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ABSTRACT 


The proposed automatic document classification system 
can be roughly divided into three phases, namely, 

1) Feature Extraction Phase, 

2) Feature Selection and Ordering Phase, 

SerClasStimication ehase, 

The feature extraction phase is designed to extract 
the maximum possible number of features from the sample 
documents of the data base. The max - min composition 
Operation in fuzzy logic is used to extract the strongest 
possible relations between features. To maximize the 
efficiency of the proposed classification system, a feature 
selection and ordering phase is included to determine those 
significant features that contribute most to the 
classification process. A feature selection technique 
based on the Karhunen - Loéve expansion scheme is applied. 
A parametric training method is used to train the document 
Glassitiereinethie classi? 1 Cation phase.) samples bauLStLCcs 
are collected from the sample documents of each individual 
Class.) USe of discriminant functions based onsstatistical 


relations between keywords and classes forms the basis of 


the classification process. 
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CHAPTER I 
INTRODUCTION 


1.1. Statement of the Problem. 

In the past two decades, the rate of growth of 
information concerned with scientific fields has increased 
in a manner often described as leading to an "information 
explosion” (1). A large number of scientific and technical 
papers are produced each day, and the existing manual 
classification systems cannot handle the resulting large 
amount of material. A knowledge of up-to-date information 
is particularly essential in the fields of science; indeed 
it is directly responsible for the further development of 
many new technologies. With the advent of high speed 
electronic computers, scientists began to think of using 
mechanical means to substitute for the intellectual task of 
document classification. 

In Canada, the problem of the information explosion is 
increased by the shortage of well-trained librarians; only a 
few universities can produce a limited number of library 
science graduates every year. The only alternative at hand 
is to switch to automatic classification systems, thus using 
computers to remedy the shortage of qualified manual 
classifiers. 

Mechanical classification has several advantages over 
manual systems. A large store of computer accessible data 
may be stored with relative ease. Unlike humans, a computer 


can be programmed to deal with several fields at the same 
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time. No human can do this with comparable efficiency. 

Stability is a desirable feature in classification 
Systems. In the conventional manual classification systems, 
the classification is almost always influenced by the 
classifier's background, attitude, and disposition. The 
quality of classification may therefore vary widely among 
classifiers. Even if the same person attempts to repeat his 
classification of a document at a later date, he may well 
produce a different resulting classification. It is obvious 
that the result of a manual classification is likely to be 
greatly biased. It is subjective and is bound to be affected 
by the classifier's own opinions and his current interests. 

The four most common manual classification schemes, 
namely, the Universal Decimal Classification (UDC) (2), 
Library of Congress Classification (LC) (2), Dewey Decimal 
GlaSsuhica ction. (DC) acz )mcanducCo LomeGuass.ulcallon UGG). 2) 
are notorious for their inefficiency in classifying highly 
specialized subjects; none of them have the required 
versatility to represent a complex scientific document Cai) 
Information scientists realize that, instead of developing 
a more sophisticated new classification scheme, they have to 
tackle the problem by an entirely new approach, and using 
a computer may well produce the required answer. 

With a computerized automatic document classification 
system, the above mentioned problems in manual classification 
may readily be tackled. The speed of a highly sophisticated 


modern electronic computer can also allow other desirable 


oie ligain”. io me a 


e#T nuts beogath Une athe TR, 5 enue yeone saan } 
prea yinbinr yvah pretend Vik nein aaete Wo vert lqup 09) 
eté Saogen of efun ohhh bering emep af? Yt neva -apelttesets : 
7 


Ver yon, Su) DOR meter a ye dndmvogh & 1p noboeattterata 
qudeeds -c%. 2T '--and de iViiewats grid (ese dnapesthh 4 savbenq 
od 'et . ufeot.t al nit na DV RecE ra Tenen » Ye tiicet ont 30d 
hegaeVin Od GJ gang” et one eve raet Aire if wl shacats vieseve 
cao ee radit, seerrury o¢ hh nm enn (alae et) ‘aehiieestlo edd yd 
.zecsese ha bos Paks aye (asia nibhogs.23de 00} oT 9 Vie 
08) nubisalttsesty Nymeagh feesovtab of? «lease - 
ienitong- yawod “(Rd ait noiany iy gents ‘adpngeed Yo vrandtd: | ye 
te) tar) patted’ Mrecet a: oie, es} {74}. sorts! diesel = 
elilg til Rare nea an cl aad Maetod “ye? euatacson #6 7 
narrugey, nies fs Weinso s99pl due bert etoege 7 
«hh ), Metdaoh shui are aavqur 02 yal thasepay 7 
Rergateynb, Vo fate ree it) ere hone) oe. entseaiaveld 
2a 22it: Wan 8 bela) taliaer s108 8 


=e 


y The Ra . ee 
5 - ; “ost i vy | PAR 


— cod <F; es 


_ - 


features. A computerized automatic classification system 

offers a higher level of searching efficiency, which results 

in an enormous time saving. In most libraries a surprisingly 

large amount of time is required to classify each document 

by manual means. Also, such classification is very expensive 

since the major expense of a manual classification system is 

the required salaries of the specialized personnel involved. 
A computerized classification also has the advantage 

of being dependable in the sense that, once an automatic 

classification system has been created, the system is there 


to stay and will provide the required service at all times. 


1.2. Important Developments in Automatic Document 

Classification. 

H. P. Luhn first suggested a statistical approach to 
mechanize encoding and searching of literary information in 
1957 (4). He used the fact that different combinations of 
words may be used to convey the same idea. Using a 
statistical approach, Luhn suggested a statistical analysis 
of a collection of documents within a particular field of 
interest. He applied the statistical results to set up a 
thesaurus-type dictionary to be used to encode and search 
literary information. Luhn suggested that particular 
combinations of words should indicate the significant 
concepts within a document, and that the frequency of word 
occurrences should provide a useful measurement of word 
Significance (5). 


H. P. Luhn furthered his research in 1958 and applied 
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his theory to automatic creation of literature abstracts Gee 
For each document some statistical information is derived 
from word frequencies and distributions to give a measure of 
relative significance of individual words and sentences. 
Those sentences that score the highest in significance are 
extracted and combined to form the abstract. The theory is 
based on the fact that the significant factors of a sentence 
can be derived solely from an analysis of its keywords. It 
is supposed that there is only a small probability that the 
writer of a technical paper would use different words to 
reflect the same notion, and that there is only a small 
probability that he would use the same word to reflect more 
than one different notion in the same paper. 

Inspired by the work of Luhn, in 1960 M. E. Maron and 
J. L. Kuhns studied the statistical relationship of words 
within a group of documents. Their approach was based on 
measurement of the relative frequencies with which the words 
appear (7). Given a request for information, a probabilistic 
indexing scheme makes a statistical inference on documents 
in the data base, and derives a relevance number for each 
document. This number provides a measure of the probability 
that the document will satisfy the request. The result of 
the search is then an ordered list of the documents which 
satisfy the request. Documents within the list are ranked 
in importance according to their probable relevance. 

In the same year (1960), G. Salton at Harvard 


University created a retrieval system called "SMART" 


on | 
. +18). weabrd: 4s vinerecreti p 
bevisal 4! ae dtl eel 


to) oveeeue ¢ ovip ofe 


“ 


,sbaeeiqus Wine 4DtRH: ts 

ets w9q@ a1) ane oP ‘aula ot! o1pos said ininaun 
> eqqads ot ares || Lae arta “ bran Hilena.sowtp 
ssnedo@e 2 Yo pun seh Goespymer eit jany 1360 aan 1° 
ts pebraweed oo o sleytiine eh: Bon? ytotng awd vb on 
e442 Ge fi ta Phane a ONO Ef ete? pieces! 
vcr gopvoli em Gee Bloew reqag, (eSPNRoReeE ra 
ALR eee caer me, nal Spar save, oie sae 

Mop fOGT 1 WN 2 boow ones ot Seen LUOR ae sand qirti 
.* Hepeey silt eis at’ we AOg% sh aN sia 

baa. ainda «9 ch Oeehiiont , nant to, an rs sd wae dikt : 
brow. Va mer eneriay ev fa phred sity aud onthuye kdl rr. 

qd bsgdt Pawo Naeeauges heat saps To) qo, 7 nda 
thse. gidieta lt dle Ash AUT Ip supe yy, "} 735 oir 6. dngm | 
steed tidadyi4a 6 rT t Dent wa 40% hu na @ meul@ A. 
rINiwsol an asitayorat Bae | a Le wh 
foe 6% Soap) aovaretor pb lia e aese ieee 
hae vais sia? 

ay Ftis Jeter 


a 


_ 


“ 
‘J 


reg 
mi 
sheet | 


V2 tex gong oat an) east i 
e271 eae vl Htnsp add 4a 


(SALTON'S MAGICAL AUTOMATIC RETRIEVAL TECHNEQUE)® C8092 
Several hundred forms of analysis were employed to examine 
each document with a view to obtaining those words that are 
most suitable for representation of, and search on, the 
document. Some of the techniques used were statistical word 
association, syntactic analysis, statistical phrase 
recognition, and hierachical arrangement of concepts. 

A statistical approach to measure the probability of 
the relationships between words and classification categories 
was again used by Maron in 1961 (10). Maron developed a 
formula to compute an “attribute number" which gives the 
probability that a document indexed by a certain combination 
of keywords will belong to a certain class. He then attempted 
to use attribute numbers to classify documents according to 
their subject content. 

He EneStd besestidiedsthes probabid ity ofs constructing 
an entirely automatic computer document retrieval system in 
12:6 let leone yiapold Cath onpofacentainpstatisticaleformulae, 
Stiles proposed to calculate the degree of association 
between pairs of index words in the data base in terms of 
their frequencies of occurrence. His system matches request 
terms and the terms used to index a document according to 
their degree of association. The documents selected are 
arranged in order of their relevance to the request. 

L. B. Doyle also published a paper in 1962 (12) which 


discussed the inter-relationships between words wichin a 


document. 
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F, B. Baker viewed the problem of automatic document 
classification from an entirely new angle in 1962 (13). 
Baker applied the latent class analysis, first suggested by 
Lazarsfeld (14), to document classification. The analysis is 
based on a mathematical model derived on the assumption that 
a set of data described by statistics may be divided into 
smal] subsets, such that in each subset the probabilities 
of different word incidences are statistically independent. 

Words chosen to describe documents are not always the 
most satisfactory for document retrieval. Recognizing this, 
in 1963 G. Salton suggested use of associative document 
retrieval techniques using bibliographic information (15). 
Salton argued that documents that exhibit similar citation 
sets are likely to deal with similar subject matter. The 
addition of bibliographic information to other standard 
criteria should therefore prove to be a valuable asset in 
automatic document classification. £—. Garfield also proposed 
use of the notion of bibliographic links in document 
retrieval. By the aid of computer techniques, Garfield 
applied the idea of citation index in literature searching 
and he set up an important scientific literature retrieval 
service known as the Science Citation Index (16). 

During 1963 and 1964, H. Borko and M. Bernick conducted 
a series of experiments in automatic document classification 
(17, 18). Techniques such as factor analysis, Bayesian 


classification method, and factor score method were applied. 
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1.3. Proposed Classification System. 


The proposed classification system is a small 
computerized automatic document classification system 
suitable for a specialized scientific information centre 
with a limited amount of classification personnel. It is 
believed that the titles of scientific papers give a good 
indication of the contents of the papers, and thus keywords 
chosen from the titles are used to represent the documents. 
In order to obtain the maximum possible relationship 
between keywords, fuzzy logic is applied for feature 
extraction and to obtain indirect relationships to extend 
the possible relationships between the given keywords (19). 
The dimension of the resulting fuzzy relation matrix of 
keywords is usually very large and hence it is desirable to 
apply a dimension reduction process. By applying the 
Karhunen - Loéve expansion in feature selection and ordering 
(20), only those keywords that are most useful for classi- 
fication are retained. The process involves determination 
of the covariance matrix for distinct keywords in the data 
base, and calculation of its corresponding eigenvalues and 
eigenvectors. The reduction in dimension is achieved through 
a linear transformation. Since the statistics of documents 
for individual classes are readily available, a statistical 
approach using a maximum likelihood discriminant function is 
employed in the classification. The theory of the proposed 
classification system is based entirely on the relationship 


between subject categories and the document content as 
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indicated by the keyword statistics. 

The general classification problem is, of course, 
dependent on the method of feature selection and the 
subsequent processing of the extracted features. This 
general problem is discussed in Chapter II. The use of 
fuzzy logic is described in Chapter III, and application of 
the Karhunen - Loéve expansion is explained in Chapter IV. 
The remaining chapters are concerned with application to 


classification of documents of a specific test data base. 
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CHAPTER II 
AUTOMATIC CLASSIFICATION SYSTEM 


2.1. General. 

Modern information retrieval systems may make use of 
concepts developed for pattern recognition. Machines may be 
designed to perform conceptual recognition by use of a priori 
information. A pattern recognition machine, as represented 
in Fig. 1, may be defined as a device capable of sorting or 
classifying patterns. Inputs in the form of measured values 
are fed into the machine, and outputs are produced in the 
EOGMeOteDrediGCULONS. = INemcCrhItTerionsOfesuccess stonesuciiad 
machine is its ability to minimize the number of 
misrecognitions so that the resulting forcasts are in close 
agreement with the subsequently observed outcomes. Many 
theories of pattern recognition are derived from statistical 
decision theory which deals with classification of 
measurements; others result from research on the perceptron 


and adaptive decision networks. 


input a pattern prediction 
measurements—%*>| recognition 
(pattern) machine (response) 


Fig. 1. A Pattern Recognition Machine. 


Before proceeding further, we shall briefly review 
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some of the terminology commonly used in the field of 
pattern recognition. According to J. T. Tou, pattern 
recognition is defined as the categorization of input data 
into identifiable classes via extraction of the significant 
features of the data from a background of irrelevant detail 
(21). A pattern is a set of data to be classified. Using 
vector notation, a pattern may be regarded as ann = 
dimensional column vector whose elements represent n 
different properties of a pattern. Each individual property 
of a pattern is called a feature. 

In geometric terms, the concept of pattern recognition 
can be expressed in terms of a partition of pattern space. 
A pattern may be regarded as a point in ad = dimensional 
Euclidean space pon called the pattern space. Patterns that 
belong to the same class correspond to an ensemble of points 
distributed within some recognizable region of the space. A 
pattern classifier attempts to group the pattern points of 
Ee into classes. If points of the same class cluster 
-together then decision surfaces may divide pattern space 
into decision regions each of which characterizes a pattern 
class. The decision regions are defined by n discriminant 
functions G;(X), i = 1, 2, «s+-« » my where X is the feature 
vector and such that for each region i there is the 
inequality G(X) > G;(X) eke Cael acaba Ae We TE! 
i #j. The decision surface that separates region i and 


region j can be expressed by the equation: 
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Thus mathematically, the problem of pattern recognition may 
be viewed as a mapping of feature measurement vectors into 
proper classes, 

The creation of an automatic pattern classification 
System may be roughly divided into three stages as shown in 
Riga 2esulheyninchudes 

1) Feature Extraction Stage, 

2) Feature Selection Stage, 

anaes) eerPatternecGlasst1ication.stage. 


Each stage is examined in detail in the following sections. 


input 


(pattern) 


Fig. 2. A Typical Automatic Pattern Classification System. 


eaerceeebedaturen Extractions Stage. 


Feature extraction is the first stage of an automatic 
classification system. Its main purpose is to extract 
feature measurements from input patterns, and to condition 
or format the input data to a form suitable for subsequent 
analysis of the features. The input feature measurements 
should cover all the information that is available about the 
pattern; they are expressed as numerical values whose 
magnitudes indicate the amount of each feature that the 


pattern possesses. Feature measurements are best expressed 
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in vector representation because this provides a geometric 
interpretation of the distribution of the various patterns 
in the pattern space. Feature extraction is important 
because the performance of the entire system is dependent 

on it. Failure to extract all available data from a pattern 
will result in a corresponding loss of information for 


processing in subsequent stages. 


Zeon pHumMan Ore Lodical) Desian Technique. 


Owing to lack of general techniques for the design of 
feature extraction, it is necessary to take advantage of any 
a priori knowledge that the designer may possess regarding 
selection of the important features. The technique used is 
problem dependent, and it is directly related to the 
knowledge and ingenuity of the designer. For example, in 
applications that involve time series data, the amplitudes 
of specific frequencies and the correlations between 
frequencies are suggested as possible useful features to be 
extracted. 

The importance of human designed feature extraction 
should not be under estimated. By utilizing the designer's 
experience it is often possible to effect enormous savings 


in the amount of statistical analysis required in the later 


stages. 


IO eS CatisticalereavuresextCacci10n., 
Usually, the features extracted by human designed 
feature extraction from initial data are not sufficient to 


facilitate an efficient decision making process; hence 
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additional features must be extracted directiy from the 
initial data, from other features, or from a combination of 
both. The techniques for extracting these additional 
features may be classified by two criteria - the means for 
choosing the subset of data or inputs from which a feature 
is extracted, and the means by which the function that 
represents the feature is chosen, 

The statistical technique of discriminant analysis may 
readily be applied to the discovery of significant features. 
With certain assumptions about the distributions of the input 
patterns, specific function can be implemented to generate 


statistical features from the data. 


2.3. Feature Selection Staqe. 

A pattern which is to be recognized and classified 
should possess a number of discriminatory properties or 
features. Certainly, one can use the brute-force technique 
of measuring all possible features and then using a large 
amount of time to process the measured information. However, 
for economic reasons, it is seldom practical to use all the 
available information. One has to minimize the number of 
features examined by the classifier, and choose only those 
significant features that are most helpful to the 
classification process. There are few theories of feature 
selection. It is highly problem dependent and tends to be 
specific to each particular application. Also, human 


intuition and experience may have to be involved in many 


instances. 
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Possible methods of tackling the problem are to 
combine the original features, to obtain a subset from the 
original feature set, or to discard the poor features and 
emphasize the important ones. The number of features to be 
selected depends on the desired degree of accuracy in 
recognition. However, insistence on higher accuracy means 
that more features must be observed. The interset features 
that represent the differences between, or among, pattern 
classes lead to the best characterizations of the input 
patterns. The intraset features common to all patterns under 
consideration carry no discriminatory information and may be 
ignored, 

Correct recognition depends on the amount of 
discriminatory information contained in the measurements. 

An insufficient number of feature measurements will not give 
a satisfactory degree of correct recognition. On the other 
hand, it is usually impractical to measure a very large 


number of features. 


2.3.1. Information Theoretic Approach. 


Approaches based on concepts of information theory 
have been suggested for evaluation of the discriminatory 
effectiveness of features. The divergence and the average 
information content of pattern classes characterized by 
features may be used as the feature selection criteria. 

The concept of divergence is closely related to the 
discriminatory power between two pattern classes that have 


Gaussian distributed feature measurements. The application 
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of divergence to measure the goodness of features was First 
proposed by T. Marill and D. M. Green in 1963 (22). 

Assume Chat gthespatternn class W407 =elni¢, ......, N 
and the feature vector X are distributed according to the 
multivariate Gaussian density function with mean feature 
vector &, and covariance matrix 24 for pattern class W:, 

TSE le 2,0 ee Ne ele conda da onal orabab minty P(X/W, ) 


may then be defined in matrix form as (23): 


i) 


P(X/W.) the probability that a pattern in class W; 


has feature vector X 
SEE reo wol 2.21 


Wibeur: Cami eimai, 2) melee enetes gt N 
n is the number of elements in the feature vector, 
[Z| is the determinant of Zi, 
(S)fadenotes theetranspose vector at (), 
and Ey is the inverse of 24. 


If we define the likelihood ratio of feature vector X: 
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from [2.3] and [2.4], we have 
L(X) = loge P(X/W; ) ~ loge P(X/W; ). [nou 
Substituting [2.2] into [2.5] and supposing that 


YL oto fhe hey, and Lis symmetrical, then 


15 


santa & 
ee 

WD pct gh Vi = 
Pe a 
emanate ata aah yeaa 
(fear ansraey a @panlyeved DNS 
(Os. Yel idedatg prea at LW eee 
'€5) GO WOT ATYtaw at beaqhtehy o@: 


w ezeia abowerrey 2 fof curl heen par = (ives 
y totshy aye? mast 


ith poh eae x 


WM i vyead, oS 0h * ts pet 

, raises BN094e) bay wat vines Notion aE OO a 
at Fo sdeoBingee ode et ye) 

ie el ‘eenqtpyT 6H) aaron (CY a4 

ct picween 3 2) an 

sh whl Pay. o18 6 009. 10. C0 gindstedie oka wartae aw BE. 7 


“Hh + oe 


i : » 


ism) 


THe tote By soe oe + af Cae A 


L(X) = f= Xa DK madh-L EK ay EX g)h 
DOSE SGP ee nei h AN 
Lez ER GDIX AXP ty 2 wa 
Beery S501 ADS aime ay coe 1 
EN eG ers 
Ge ey se Ce aa) 
Sash eo Feet Ay iain 2 Chat LC - M3) 
: KDC kei a a Chit fs) 2 Che - Ai). 
i276) 


Assume that the feature vector X is from class W:, then 


the expected value of L(X) is: 
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whereas, if the feature vector X is from class W;, the 


corresponding value of the expected value of (exes 
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The divergence between W; and W; is defined as: 
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Substituting [2.7] and [2.8] into [2.9] 


J(Ws, Ws) 


rage Sy fy Toa Ai)-F Eg fs) Z"'Ys~4s)] 
Cha ~fha) B° ha Aes). [2.10] 


Note that if Z = I (the identity matrix) in [2.10] then 


J(W,, W;) represents the squared distance between Ajand A; 
Iifepayes tidecisiona rid emis) usedt fore cilassifer,. then; for 
P(W:) = P(W;) = + (only two classes), 
X € W; (X belongs to class W;) if A(X) 21 or L(X) 2 0, 
and X € W3 (X belongs to class Wi) if A(X) 4 1 or L(X) 4 0. 


The probability of misrecognition is 


e(W,, W;) = P(W;)-P[L(X) = O/W;] + P(W;)-PEL(X) 4 O/W,] 
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From [2.6], [2.7] and [2.9], it may be concluded that 
P[L(X)/W.] is a Gaussian density function with mean dJ(W,, W3) 
and variance J(W;, W3;). Similarly, P[L(X)/W;] is also a 
Gaussian density. function with mean -3Zd(W., W;) and variance 
iis ota 

Using [2.11], it can be shown that (24): 
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e(W;, W;), is a monotonically decreasing function of Wate alten) 
Therefore, the features selected according to the magnitude 
of J(W;, W;) will provide a corresponding discriminatory 
power between W, and W;. 

Assuming that features are mutually independent in 
their effect. on the decision, Pe M. Lewis! in11962 { 25) 
proposed that a single number statistical function G;, such 
as entropy or average information, may be used to measure 
the goodness of feature characterization. The value of the 
function G, is obtained by evaluating the feature C; over 
a large sample of patterns to be recognized. Statistics 
that are desirable for G; are summarized as follows: 


1) If G, 4 G,; then C,; should give a larger percentage 


J 
of recognition than C;. 

Cimtlst GWavsGeer thien Ciactac eshiould thave tap Jargey, 
percentage recognition than C; + c, where c is a 
constant. 

3) Let Cs be any set of features and let Gs be the 
sum of the values of the Goes Olen een ed cle seni 
Cs. Let Ps be the percentage of correct 
recognition when using the feature set Cs. Then 
a relation Ps = A-Gs + B should be true where A 
and B are constants. In other words, the 
percentage of recognition should be a linear 
function of the sum of the G, values. 


There is not a single number statistic that can satisfy 


all these desirable features. However, Lewis has proposed 
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a single number statistic that can meet most of the 


requirements. The function is as follows: 


PL C;CK)/w, | 


PLC;(K)] eon 


NOUN; 
Gy Ss RA Zz PLW; C;(k)] log 
It may be noted that if PLC, (K)/W, ]/PEC;(K)] approaches 
unity, that is, if each feature by itself is not very efficient 
in the recognition, then G; may be approximated by the first 


term in its power series expansion: 


ae WAT APL paniOh Fea ele ide [2.16] 


The selection of features by use of a single number 
statistic has proved to be quite effective in Lewis's 
experiments. However, two restrictions have to be observed 
in using this goodness measure; the features selected have 


to be those supplied by the designer, and these features’ 


have to be statistically independent. 


Zeoece DirectheEsStimation of Error Probabila ty. 


A knowledge of the feature distribution is not always 
available in most pattern recognition problems, Although 
the probability density structure of the feature measurements 
can be approximated, it is still rather difficult to obtain 
an exact analytical measure of feature effectiveness which 
directly reflects the recognition accuracy. Based on Parzen 
and Cacoullos's results on the techniques for estimating 


density functions, a nonparametric method for feature 
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selection was proposed (26). 

The method is based on direct estimation of error 
probabilities from a given set of pattern training samples. 
Let R(Z; h) denote the rectangular parallelepiped in ap - 
dimensional measurement space My centred at Z; it is defined 


by the relation: 


ee ee Geen plucete UA; Pein 


WHGhCaNl ay Nas cokes Bp. are positive constants. 


Let B(Z) be denoted as the estimated density function, 


then 
P(Z) = * {number of samples falling in R(Z; h)} 
ny (2h; 
V=l 
oA Z,-%3 : e~X; 
2 TK 2) [2.18] 
AER J 
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where the weighting function K(Y) is defined by 
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0 otherwise 
(i250 
The problem of studying the estimates in [2.18] is to 
choose suitable h; and K(Y) for a given number of training 


samples n, and to prove the consistency of the estimates 


when n--o9, 


Without loss of generality, let U;(y) be the weighting 
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function defined by: 
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and consider the case when the rectangular parallelepiped of 
[2.17] is to be a hypercube centred at Z, so that h, = NEARS 
-- = hh = o(n). The problem is thus reduced to finding the 


estimate of the form: 
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Essentially, [2.21] is used to estimate the density function 
by measuring all the distances between an activated point Z 
and all the sample points along each coordinate bases in the 
p - dimensional continuous measurement space. 

Now suppose that for pattern class W; the feature 
vectors X;'s are characterized by a fixed, but unknown, 
probability distribution over S discrete values of the 
measurement space where 

" 
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Applying the idea in [2.22] into [2.21], and expressing it in 


vector form, allows the estimated density function of EA 
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to be expressed in discrete form as: 


B(Z) 
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+ yy (LEA CZ Ky) [2.23] 
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It may be noted that the parameter O(n) is assumed to 
be a given constant and there is little freedom to control 
it. For a given O(n) we can only increase the number n of 
training samples so that the estimate in [2.23] is 
assymptotically unbiased, Therefore, for a large sample 
problem, the conditional probability at the activated point 


Z for a given class W; becomes 


p(z/w,) = —Cib2d [2.25] 


Ns 
where C(Z) 7 y( EAC) [2520] 


If the maximum likelihood discriminant rule is used 
for classification, the feature selection criterion is based 
on the direct estimate of the minimized probability of 


misclassification estimated from n training samples as: 
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Pa(€) = 24 20(2,/;)°(W;) - max[P(Z;/M,)P(W,) I} £2.27] 


where p(Z,/W;) is the conditional probability for a given 
class W; estimated by [2.23] with n; training samples from 
class W;, m is the number of pattern classes, and S is the 
total number of discrete points in the N - dimensional 
feature space. 

A direct estimate of [2.27] is thus used as a feature 
selection criterion. The feature set of Kx is considered 


more effective than the feature set HK, if Py (€/K,) 4 PL (€/KH,). 


2.3.3. Feature Space Transformation. 


A linear feature space transformation technique for 
the feature selection problem will now be described. S. 
Watanabe introduced a feature space compression technique 
based on the Karhunen - Loéve (K - L) expansion (27). 

K. S. Fu and Y. T. Chien generalized the transformation 
technique for extracting the effective features (28). 

Acca (Cle Gi oi nes Val ebemamrandom 
variable from a class W; with zero mean. The K = L expansion 
allows the expression of a random process in terms of a set 
of orthonormal vectors. Any component of the sample vector 


x™ can be expressed as a function of the orthonormal set of 


vector Sys Kerli oC oul alnselsene nt such that 
a e w 
Kees LOO Yel [2.28] 


where y,, is the £th component of the kth orthonormal vector 


Y and Kris a coefficient of yy, for the class W;. 
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If the generalized expansion in (2.28) 2ex5sts tore || 
fan) 
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the conditions: 


WOU OG caw nec an N, and the coefficients satisfy 


One aif Karr 


jedit ELy (%,) = [2.29] 
a 0 if k#Z, 


then the component of the ensemble covariance function 


K = |[Cy,l| becomes: 
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The expression in [2.28] whose coordinates 
Ys k= 1, 2, cee. » N} are determined by [2.30] through 
the covariance function K = Were tae will be called the 
generalized K = L expansion. Here c., ee rere cer, NA 
are the eigenvalues of the covariance function K and Yy; 
Le mleeeeee uN ane. thee Lgenvectors:. 

The orthonormal coordinate system produced by the 
K - L expansion minimizes the entropy function over the 
variances of the coordinate coefficients and ensures that 
the discriminatory: information over the ensemble of the 


probability space is concentrated in a few coordinates by use 


of a linear transformation. 


gaa ee Stochastic Automata Approach. 


Automata with learning behavior in random environments 


can be used for feature selection if the automata are defined 
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in terms of feature subsets, and the environments are 


characterized by training samples and a certain decision 


rule. 
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Fig. 3. Automaton Operating in Random Environment. 


Fig. 3 shows the interaction between an automaton and 
a random environment C which is characterized by 
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The input y of the automaton can only assume two values: 


0 {no penalty) 
we 2a 
1 (penalty). 


The output f of ie automaton is its response, and Q is its 
internal state. In an experiment, if the automaton takes 
action fos T= laces. Soke cCleNme cence tarln DUteyencOmalle 
machine becomes: 
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Q with probability 1 = P.. 
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desirable to design an automaton with minimum expected 
Denaity. lnelIGInKs Suku andi Tel. ti proposedia learning 
automaton Ay, which can achieve this purpose (29). The 
Pyopased inode’) seas yshown kin Gig. e4,0ihas tkeact tonsier ee fe, .. 
cee 9 f, Each corresponding to r states (hence the memory 
capacity ais em) 2heStantingaf romethe ef inst stateneeveny + 


consecutive states correspond to one particular action fe. 
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It can be shown that the expected penalty for model 


Ay, operating in the random environment C = C(P, 4 Pay sooo. 
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It may be noted that M(Ay, 3 C) is monotonically 
decreasing with respect to the memory capacity r._ The 
expected behavior is preserved even with r = 1, that is, the 


strategy due to model Ay, iS always better than a pure 
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random strategy. Furthermore, asymptotic optimality is 
achieved in the sense that Jim Hh Uee oe) Pht phe ras: 
ese » Px). In other words, during the experiment the 
probability of applying the best action tends to unity as 
the memory capacity r increases indefinitely. 

Owing to the learning property of Ann » the model can 
be employed as a feature selection scheme. Each possible 
subset of feature measurements selected is considered to be 
an action f; taken by the automaton. The input of the 
automaton is 1, or 0, depending on whether an incorrect, or 
correct, classification is made on the basis of the selected 
feature subset. If the state transition rule of the A,, 
model is used, then the optimal action, or the feature 
subset corresponding to the minimum probability of 
misrecognition, will be selected most frequently. Thus, the 
relative effectiveness of the feature subsets can be measured 


by their relative frequency of occurrence. 


203.5... Comparisons. of Different Methods for Feature 
Selection. 

When the features from each class are distributed 
according to Gaussian probability density functions with 
unequal covariance matrices, we can use the minimax linear 
discriminant functions to derive a separability measure for 
the problem of feature selection in parametric multiclass 
pattern recognition. It appears that in general, when the 
covariance matrices are unequal and the overall 


misrecognition is used as a measure of feature effectiveness, 
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28 
the separability measure based on minimax linear discriminants 
could be useful criterion of feature selection in multiclass 
pattern classification. 

Direct estimation of error probability is a 
nonparametric feature selection technique. Since no 
assumptions are made for the probability structure of the 
feature distributions and the independence of measurements, 
the proposed nonparametric technique is more effective 
than the parametric feature selection technique when little 
a priori knowledge of feature distribution is availble for 
each class. Moreover, when the number of training samples 
is small, the nonparametric method based on density 
approximation can produce comparatively good results. The 
computation time required for the nonparametric method of 
feature selection is, in general, more than that of the 
parametric method. 

The feature selection technique based on the 
generalized Karhunen ~ Loéve expansion is superior due to 
the fact that the transformation procedure is less sensitive 
with respect to the probability structure of each pattern 
class, which, in practice, may only be estimated from a 
limited number of training samples. However, the optimality 
of the technique is defined over the ensemble of classes, 
and it makes no explicit provision for the discriminatory 
analysis between classes. for all P £N, where N is the 
total number of features extracted, the transformed P - 
dimensional feature space is less effective than the same 


dimensional feature subspace selected by the parametric 
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feature selection technique based on linear discriminant 
functions and the nonparametric feature selection technique 
based on direct error estimations. When the class 
distributions are unknown, the feature space transformation 
technique is superior to the parametric feature selection 
technique in multiclass pattern recognition. Considering 
the computation time required, the feature space 
transformation technique has some advantages over the 
proposed nonparametric feature selection technique based on 
direct error estimations. However, to obtain an overall 
optimum performance of the pattern recognition system, the 
nonparametric method of feature selection technique is 
preferred. 

The main advantage of the stochastic automata model 
for feature selection is its simplicity in implementation. 
Furthermore, the approach is also nonparametric in nature 
and able to match the decision rule used by the classifier. 
The basic concept is to reformulate the feature selection 


problem as a decision problem with adaptive behavior. 


Po Apempattern Classi tl catlonestage. 


Pattern classification involves the determination of 
an optimal decision procedure in the identification and 
classification process. A classifier can be regarded as a 
black box with n input lines and a single output line as shown 
in Fig. 5. The feature measurements of a pattern constitute 
the n inputs and the prediction constitutes the output or 


response. The machine attempts to decide to which pattern 
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class the observed data belong. Since the pattern classes 
can be represented by n disjoint regions in the feature 
space, the classifier generates decision boundaries in the 
form of discriminant functions to separate the n pattern 
classes. The discriminant functions are scalar and Single 
valued functions estimated from the observed sample feature 
measurement vectors such that if G,(X) has its largest value 
for a pattern X then X is assigned to class W;. The 
percentage of correct recognitions depends on the effective 
utilization of the available discriminatory information. 
Based on the correlation between them, the patterns that 
possess the same mathematical or statistical features are 
grouped in the same class. The decision functions can be 
generated in a variety of ways. When a complete a priori 
knowledge about the patterns to be recognized is available. 
the decision functions can be determined with precision. 
When only qualitative knowledge about the patterns is available, 
reasonable guesses of the forms of the decision functions can 
be made. In this instance the decision boundaries may be far 
from correct and it may be necessary to design the machine 
to achieve the desirable performance through a sequence of 
adjustments. Machines for recognizing such patterns are 
best designed by use of a training procedure. Two training 
methods are proposed for adjusting the discriminant 
functions, namely, 

1) Parametric Training Method, 


and 2) Nonparametric Training Method. 
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2.4.1. Parametric Training Method. (30) 


Pattern classification can be treated as a statistical 
decision problem. Assuming that the feature measurements of 
the patterns are normally distributed random variables, the 
joint probability density of n components of the feature 
measurement vector is then characterized by a multivariate 
normaladistribution.) For ithe tith pattern .cldass under 
consideration, the normal probability density function is 
completely specified by the mean vector 4; and covariance 
matrix 24. Sample patterns can be used to estimate these 
two parameters for each class, and the formation of the 
discriminant functions are based on these estimated values. 

It is found that the patterns in n classes are governed 
by n distinct probability functions P(X/W;), 7 = 1, 25 «eee . 
es » nN. The function P(X/W,) is defined as the probability 
of occurrence of pattern X given that it belongs to class W;. 
There is another probability P(W,) which is the a priori 


probability of occurrence of class Wj. The discriminant 
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function for class W; can be expressed in terms of P(X/W; ) 

and P(W,), both of which may be estimated from the pattern 
samples. There exists a loss function A(W3/W;) which is 
defined as the Toss incurred when a pattern belonging to class 
jJ is misplaced by the system into class i. When A(W;/W;) is 
minimized in some sense for all i the system is said to be 
optimal. 


The conditional average loss when X occurs is defined as 


n 


L,(W,) = 2 A(W;/W,)- PCW /X) [2.34] 


=! 


where P(W5/X) is the probability that a given pattern X 
belongs to class W;. The L,(W,) may be calculated for any 
Specific wewatoeah! tpossiblervalues of ai. ti tsai*tec, 62... .5, on 
anim tmtea. 
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then the system would place X into class W;. 


By the Bayes‘ rule, the a posteriori probability 
P(W; /X) is given by: 


P(W; /X) = Sa (2ee.on 


where P(X/W;) is the likelihood of W; with respect to X, 


P(W,) is’ the probability of occurrence of Class Wj 
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and P(X) is the probability of occurrence of pattern X. 
Substituting [2.36] into [2.34] gives the conditional 


average loss: 


ZACWi/Ws)-PCWs/X) 


n 3) . 
2 AGW / Wg) ee 


CW) 


(ES 2 ACWi/Ws)POK/Wa)-PCWs). (2.871 


It may be noted that P(X) is independent of i, and therefore 
L,(W,) 2 Inu, /W3) P(X/W;) PCM). [2.38] 


If we assume that an error in classification is 
equivalent to a unit loss, we can define A(W;/W;) to be a 
special, loss’ function known as the’ (0.- 1) Joss function or 


symmetrical loss function such that 
pe Ae SAS 0;; 2539) 
where ah is a Kronecker deita function such that 


Tale gies 
Oi = [2.40] 
OMT fag. 


Substituting [2.39] into [2.38], we have 


n 


L,(W,) = aA A (Wg /Wy )PCX/W Gg) PCM ) | 
a” an ~ P(X/W;):P(W,;). [2.41] 
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It is obvious that we can minimize the conditional 
average loss by maximizing P(X/W;)°P(W.). Therefore, the 


maximum likelihood decision can be defined as 
G, (Xx) = P(X/W,)-P(W.). (787)| 
Equivalently, 


G.(X) 2 loge [P(X/W, )-P(W; )] 


or 


A 


ales 


TOGaRUAy We y)esten tO, F (Wee). Pens 


The function P(W.) is readily computed by counting the 
occurrence of class W; elements in the sample training 
patterns. The conditional probability P(X/W.) is obtained 
by making the assumption that the pattern components are 
normally distributed random variables. This assumption is 
not always met; however, the results are often found to be 
superior to those dependent on other assumptions. 

Based on normalized distance, the multivariate normal 
pattern is taken to have an n - variate normal probability 


distribution 
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HG dis) tne oas eg exp{- EK VE (CX a)} 12.44] 


where the pattern X is an x 1 column vector of the form: 
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and 4, is the mean feature vector estimated by 


a a pales > 
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covariance matrix where 
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Yas is the inverse of 2, 
Wel is the determinant of "ay 
and N is the number of sample feature vectors. 


Substituting [2.44] into the expression for optimal 


classifier for normal pattern, we have 


logeP(X/W;) + loge P(W; ) 


6.(X) 


logeP(W;) - Blog 2 - FlogelZil 


-AE(X - A) Ly (X - 4) 1. [2.45] 


It may be noted that logeP(W;,) - 4 logel2;l does not 
depend on the particular pattern being classified; it may be 
treated as a constant and denoted by b;. Redefining G,(X) to- 


exclude the common term (-5loge2ij) for all i, we have 


g,(x) 2 by - SEK - Ad'2y (X - Ay) [2.46] 
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2.4.2. Nonvarametric Training Method. (30) 

When no assumptions can be made about the characterizing 
parameters of individual class, a nonparametric training method 
may be used in pattern recognition. The method is a 
distributive free adaptive procedure designed to find the 
most appropriate weight vectors for the discriminant functions. 
When applying a nonparametric training method, functional 
forms have to be assumed for the discriminant functions. 

Three possible forms for the discriminant functions are: 

Teel inedy. 

CeeOuadratics 

and 3) Piecewise Linear. 

These functions all contain unspecified coefficients 
which constitute the weight vectors. For example, in the 
linear case, the linear discriminant function is defined as 
an -equacmon Ndavang the form G.(X) = WX, +fW,X. + 22. .5. 

SOW Xe AWeMEeWItH (W.5. Was «ocel go Nn Niner ) aDeing the 
weight vector. A typical linear classifier using linear 
discriminant funcauLrons 1S shownein Pros (6.0 8in practice, 
the proper values for the weights are unknown; initial 
estimations have toamade and the machine is designed to 
adjust the weight vectors of the discriminant functions 
until the machine performs adequately on the training sets. 

Nonparametric training is applicable to a wide variety 
of distributions; we can control the complexity of the 
classifier by prior specification. However an optimum 


performance on training set does not guarantee a similar 


~ 


pare rvesuncats ule Fie phan 
headin on pnihess ee 
68) hottap oF ert ingessy 
and ants oo beng bewk dgubasetq avisqebt 4 
Feebinne® tqentaly TNT aoe Tat, anos sey sipiew 
rehwlenon? Sod! Aateben attSohevomem & 


shut on nie ee edd 07% hes 


a 
— > savgnome 
seals yy aa 


ping! 2) ns ale Winegen Wes [16 anakaehdt ease 
| tal famaxs 407 Lee abba mie 4 
¢ veneer ef ngtranw Ld oat ,ocee 
on pects © ee Xm Ce oA?) enlved 
2 wala eC ul gal) « vawe’s sell at) atte ~at * 
sopalt tatae wit eeass Siete Aly tony A .s0vsee 
sation. Ot 6b ott Wythe at een) acent 
rabdinl zahontin wie eiipiee mt epuisy saqgo7g re 


iliae wat ni aéh wose “ a 


bt Gong reab wh | 


PERO 
dA a esta a 


av 


performance on other data. The performance of the classifier 


is directly proportional to the number of sample patterns 


being observed. Therefore, as the number of samples 


approaches infinity, the classifier should give perfect 
recognition. 
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CHAPTER III 
BUZZ SLOGLG 


3-i.,..General. 


In information theory we usually treat the uncertainty 


and imprecision of a problem through the concepts and methods 


Of probability theory. However, in most real situations in 
information retrieval the source of imprecision is the 
absence of sharply defined criteria of class membership 
rather than the presence of random variables. Many classes 
are fuzzy rather than precise in nature. Objects need not 
necessarily either belong, or not belong, to a class; there 
may be intermediate grades of membership. To describe the 
degree with which an object belongs to a class, L. A. Zadeh 
proposed a multivalued logic with a possibly continuous 
anrinmity of truth values U3l, 32, 33, 34). eZadeh sulogiuceis 
based on the idea of fuzzy set. 

A fuzzy set is a class with an unsharp boundary. It 
is an imprecisely defined class in which the transition from 
membership to non-membership is gradual rather than abrupt. 
There may be grades of membership intermediate between Te, 
membership and non-membership. 

Let X be a collection of objects with each individual 
element denoted by x, that is X = {x}, Then a fuzzy set A 
in X can be characterized by a membership function %,(x) 
which assigns to each element x in X a number in the closed 
interval between zero and one. The value of A(x) indicates 


the grade of membership of x in A. Thus, the nearer the 
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value of A(x) to unity, the higher will be the grade of 
membership of x in A. 

The concept of fuzzy sets has proved to be relevant to 
a wide variety of problems related to information processing, 
information control, pattern recognition, system identification, 
artificial intelligence, and many other types of decision 
processes that involve incomplete or uncertain data. The 
idea of fuzzy set has also found application in feature 
extraction, which is the first stage of the proposed automatic 


classification system. 


SecsmbastCeDeTiIni tions: Of Fuzzy Sets. 


PrLOGetOr us ung @UZZY lOglG InsapplilCatlOonetoupnacticad 
problems it is necessary to construct a mathematical 
framework for manipulation of fuzzy sets and the study of 
their properties. We shall begin the discussion of fuzzy 
sets with a number of basic definitions. 

1) Two fuzzy sets A and B are said to be equal, 

written as A = B, if and only if M(x) = A(x) 
ay Gore aliie x eine, 
2) A fuzzy set A is\empty if and only if ats 
membership function is identically zero for all 
VGRLIt ee Ula vans. Ph) =" fotea ll XaLnens 

3) The complement of a fuzzy set A is a fuzzy set A’ 
whose membership function is given by Ay(x) = pS 
Mh*) + 

4) A fuzzy set A is contained in fuzzy set B, or 


equivalently, A is a subset of B, or A is smaller 
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than or equal to B, written as Ac B, if and only 
if Afx) & Mol X) lle CVE Sarai O¢, 

5) The union of two fuzzy sets A and B is denoted by 
A» B and is defined as the smallest fuzzy set 
that contains both A and B. The membership function 
Of Ae EEeis expressed by 4.7. (x) "= max[Aea(x), A(x) ]; 
or equivalently, Mave (x) = Ml x) Vv Jel X) « 

6) Similarly, the intersection of two fuzzy sets A 
and B is denoted by A © B and is defined as the 
largest fuzzy set contained in both A and B. The 
membership function of A ®% B is expressed by 
Peanw(x) = minLAd(x), A(x) 1, or equivalently, 
Panel *) = lx) © g(x). 

The intersection and union of two fuzzy sets A and B 


Canene dilustratedegraphicallysassinghig. 7 
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Fig. 7. Diagram Illustrating the Union and Intersection 
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The membership function of union of fuzzy sets A and B 
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is comprised of curve segments 1 and 2 ethat sof, ttie 
intersection is comprised of 3 and 4, 

The notions of union, intersection and complementation 
play important roles in fuzzy logic; they can be extended 
easily to many basic identities for fuzzy sets. Examples of 


a few are as follows: 


Aue Bul Cee uhe Ue Bila a 
Associative Law, 
Neate (BieteC)a= sChe OF Bien AC 
CReUe Bt = Aline 5" 
DeMorgan's Law, 
CFT BS PC ets 7 WU gah 
ConA) BieatC DA) ewWasliC O68 Bs) 
Distributive Law. 

Cov Aso, BlaaaiCC vy. Ad oC. vy B) 


These, and many other similar equalities, can be readily 
established by showing that the corresponding relations for 
theamenbersiiy tunctions Of A, Bb andeCearemident ities: ss, o7 


example, 

CAORER)) ea Nie ae Be 
is equivalent to 

T= max( f(x), 40x) ] = mint! = atx I 
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maxLhdlx), min{ dfx), Mg(x)}) = minQnaxf lod, Aalof, 
max{ x)» Kel x)}]. 


One can interpret the intersection and union of 
fuzzy sets in terms of dropping a ball bearing through a 
network of pipes. Let B be a fuzzy set which is expressed 
HimCeLSeOT ea fh alli 1y 00} shuz ZY eS. 6 Soh och peemeeee enn 7 
through the connections Uv and ©, and let &(x) be the 
HeMversil DaLUNnCUUON LOU Uc Zy Sel tAs yal a=anl neces en terrae 
with respect to x. Analogously, let P be a pipe whose 
passage clearance for a ball bearing x through it can be 
simulated by the resultant passage clearance for the same 
Dae veaning xX. tiaougn a NetWwOrkis0t. pipes omy mks mretenst. ee, 
in series and parallel connections. If S;(x) is the passage 
Giearance for bald, pearing xX throug pipes 0-101 Belin mec ii. 
we y n, then A(x) v &;(x) and MAX) A Axx) correspond to 
parallel (or operation) and series (and operation) of S;(x) 


and S.(x) respectively as shown in Fikals she 


5,09 
§,09 $3; 
S,@) 
parallel connection series connection 


Fig. 8. Parallel and Series Connection of Pipes Simulating 


Union and Intersection of Two Fuzzy Sets. 
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Therefore, an expression TM Vol ing? Axe ARTs Le Mahe ustand 
a corresponds to a network of pipes Q,, OF 0 ieee OF end ch 
can be formed by the conventional Synthesis techniques 
employed in switching theory. As an example, 

C= (CA Roe AG aue Ae uel Ane a, ) 0 Ad 


can be denoted by the network of pipes as shown in Pikele vele 


Ss 


Fig. oe mea Networkmofe (hi pes #S imu abiitg 


Lax) & And} ve Aso) TYalx) v Aeslxd}  Acg(od]. 


It may be noted that the passage of a ball bearing x 
through a pipe also depends on the diameter of the ball 
bearing. The whole network itself is equivalent to a single 


pipe whose passaqe clearance is of size Sp (x). 


Bio shy,  UMMelele ec atte Operations On ShUz2y. S9et Si 


Besides the operations of intersection and union, 
there are other ways of forming combinations of fuzzy sets 


and of relating them to one another. 


1) The algebraic product of fuzzy sets A and B is 
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2) 


3) 


written as A + B and can be defined in terms of the 
membership functions of A and B by the relation 
Aewe(X) St Malibu yal X)s 

The algebraic sum of fuzzy sets A and B is denoted 
by A + B and is defined by 

Peal x) = Mey(x) + Me(x) 

provided that the sum of (x) + (x) is less than 
Oreedua Latoeuniuby et Or fall. Yen ke 

The absolute difference of fuzzy sets A and B is 
written as |A - B[ and is defined by 

Mowei(x) = Yalx) - Aslx) 1. 

Let A and B be two arbitrary fuzzy sets. The 
convex combination of A, B and a third fuzzy set A 
is denoted by (A, B; 4) and can be defined as the 
linear combination of A and B in the form 

(As Bs A) = AA + ALB 

where A’ is the complement of A, Expressing the 
relationship in terms of the membership functions, 
we have 

Ae casa) (x) = MAX) Mealx) + [1 -Ae,(x)] Meelx). 

A basic property of the convex combination of fuzzy 
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‘ This property is an immediate consequence of the 


inequalities: 


min[ A(x), Aeg(x) ] * Akt,(x) + (1 - AlAQ(x) 
£ max Aix)» A(x) ], where O4 A £ 1. 
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3.4, Fuzzy Relation. 

The concept of relation has a natural extension to fuzzy 
sets. It plays an important role in the theory of fuzzy sets 
and their applications. The term "relation" can be defined 
as a set ofmordered pairse(34)4 leFonlexample, etheyset of all 
ordered pairs of positive integers x and y such that x = y? 
can be regarded as a relation between x and y. In terms of 
fuzzy sets, if X = {xt and Y = fy}, then a fuzzy relation R 
between px and: Weisha fUzzy iseter ain thegspnedichesnacar ix ms 
such that F is characterized by a membership function 
MAx, y) which associates with each pair (x, y) its grade of 
membership in F. For example, the fuzzy relation, x« y, 
where x and y are both in space S, may be regarded as a 
fuzzy set 6 “TogSexes tor S* such that the membership function 


of By, Malx, y), may take on the following representative 


values: 

ceo enliO) = 104 

Aol 10, LOO} g47086% 

JAN 1000) #=81, etc. 

The values of the membership function fel Xs y) lie within 
the closed interval [0, 1] and it is referred to as the 
strength of the relation between x and y. 

More generally, one can interpret an n - order fuzzy 
relation of fuzzy set X in space Y as a fuzzy set A in the 
Praduct space y" with the membership function in the form 
MalXs s %ao ceeee 9 Kya coves 9 Xq) where x; is a member of X, 
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basic definitions that are related to fuzzy relations: 

1) The domain of a fuzzy relation A is denoted by 
dom A and is a fuzzy set defined by 
Me dom a(%) = V(X, ¥), xe X, 
where the supremum yi is. taken over ail yong, : 

2) The range of a fuzzy relation A is denoted by 
ran A and is a fuzzy set defined by 
ronan AY EN Moe 9) Xk ] Kay men Ve 

3) The height of a fuzzy set A is denoted by h(A) and 
is defined by 
DC Alms Yn Lalxen a) « 
A fuzzy relation A is said to be subnormal if 
WGA) 2-al.cand) normad oife wn CA), =) wi 

4) A fuzzy relation A is said to be contained in fuzzy 
relation, .B iit 
fibis Yut Pel Xs var LOTS ca IpleiaXesuey Jeplitiaatnie 

product space X x Y., 

The containment of A in B is expressed in the 
form Ac B. 

5) The union of fuzzy relations A and B is denoted by 
A + B and is defined by 
Mepso(x, y) = max[ glx, vy), Mealxs y)] 
where x € X and y € Y. Union may also be expressed 
by writing 
VL SOLID A ACLS ASA 9 /AOS, Slr 

6) The intersection of two fuzzy relations A and B is 
denoted by A ® B and is defined by 
Ma(Xs ¥) = min[&,(x, y)s Aelxs y)] 


where x € X andy e€ Y. It may also be expressed 
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in the form 
Mie say el hey) ee gee ie 

7) The product of two fuzzy relations A and B is 
denoted by A+ B and is defined by 
Malxs y) = LlXs ¥) + fal, y). 

It may be noted that if A, B and C are any fuzzy 
relations from X to Y, then the identity 

C(A + B) = CA + CB 

will hold. 

8) The complement of a fuzzy relation A is denoted by 
A' and is defined by 
Meolxs y) = 1 = heels y). 

9) The composition, or more specifically, the max - min 
composition, of two fuzzy relations A and B is 
denoted by Bo A and is defined as a fuzzy relation 
whose membership function is related to those of 
A and B by 
Mol» y) = max mint hlxs vis Aglvs y)I, 


or equivalently, 


Meoen(Xs ¥) = VEL Xs ¥) 4 Aeglvs yd]. 


In the following section the max - min composition is 


used to define a similarity relation. 


Shs SAU Wh BN Relation. 


It may be noted that the operation of max - min 
composition can be applied to a single fuzzy relation A to 
find the similarity relation of elements in a fuzzy set A, 


The similarity relation S between elements x and y in fuzzy 
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set A may be defined in terms of membership function as 
follows: | 

seep eae y) = max min[ A(x, V), Aly, y)] 
where x, y and v are all contained in A, and v ranges 
through all its n possible values where n is the number of 
elements in fuzzy set A. 

A set of n min L,(xsy) functions are selected between 
pairs of L,(x, v) and Peds y) and the max &,(x, y) is 
searched through the set of min La(x, y) functions. 

The n = order composition of A © Ae,...°A is denoted 
by A". If A is a finite set, Aa may be represented by a 
relation matrix whose (x, y)th element takes on the value of 
JEASE y). The similarity relation matrix for a fuzzy set A 
is given by the max - min composition of the elements in a 
GelduLonemacnix Ay. 

The concept of a similarity relation is a generalization 
of the concept of equivalence. Zadeh found that it is 
possible to adapt the well-developed theory of relations to 
situations which involve classes that do not have sharply 
defined boundaries. A similarity relation S is a fuzzy 
relation that is reflexive, symmetric, and transitive. 

Let x, y be elements of a fuzzy set A, and let A(x, y) 
denotes the grade of membership of the ordered pair (x, y) 
in S. Then S is a similarity relation in A if and only if 
Owed) lex eye and voi A’ 

1) fas yee ST for all x in dom S (reflexive). 
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2) Adlx, y) = Aly, x) for all x and y in dom $ 
(symmetry). [Sc 
39 eSu-tS eaSalmaxase min transitivity) or more 
specifically 
Aeslx, y)2 VIA xv) A Aly, y)] [3.3] 
where VY and A denote max and min respectively. 
An example of a fuzzy relation matrix having similarity 


“elationssiseshown«ineFigsel0; 


0.3 1 0.6 0.2 0-5 
0.8 0.6 ] Ol 0.4 
O75 Oa 0.1 ] 0.2 
0.3 0.5 0.4 Blea ] 


Fig. 10. Relation Matrix Having Similarity Relations. 


sp ceeee 9 X; be k points in X such that 

ACX, A cess rer (x5 » Xa, ) are all greater than zero. 

Then the sequence C = (Xi, 5 Xi, 9 eevee 5 x, ) will be said 

to be a chain from x, to x;, with the strength of the chain 

defined as the strength &( ) of its weakest link, that is, 
Strength of (xX; 5 Xi, » seeee 5 pel = 

mintkl xs, » Xi, oA (Xa, s ig de veces s AK Ha DI. 

From the definition of the max - min composition, it 

follows that the (i, j)th elements of San Ni Bedell ec ees ces 

is the strength of the strongest chain of length n from 

X, to x;. Thus, the transitivity condition may be stated in 


words as follows: 
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Strength of the strongest chain from x, to x;. 


It may be noted that if X has m elements, then any 


chain C of length k 3 m from Xe peo eX: 


;. must necessarily have 
& 


loops (the presence of one or more elements in X appear more 
than once in the chain C = (x; , RGus wena ease hie Duet ULNe 
loops are removed, the resulting chain C of length 4 m will 

have at least the same strength as C. Consequently, for any 


ebements x., x. in X, we can assert that: 


J 


Strength of the strongest chain from x; to x; = 


strength of the strongest chain of length 4 m from x; to x,. 


3.6. Feature Extraction Based on Fuzzy Relation. 

Lene 7 i S'.* Tamura, S..u1gucni. and Ki flanaka proposed a 
method of classifying patterns using fuzzy relation to 
measure the similarity between each pair of patterns taken 
from the population of patterns to be classified (19). The 
similitude between any two patterns is calculated using the 
max =- min composition of a fuzzy relation. The similitude 
is extended to n - step such that the complete similitude 


between two patterns is achieved. 


Let X be a set of patterns, the fuzzy relation A on X 


is characterized by its membership function A(x, Py oa ely Aes 


for all x, y € X. The one - step fuzzy relation TEE OSB 
satisfies two conditions: 
1) falxs my) 


ay A WD Maly, x) for all x, y in X. 


Condition 1 implies that x is exactly the same pattern 
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as X, and condition 2 implies that the fuzzy relation is 


Symmetric. 


if fel%s y) is known for each of the pairs of patterns 


in X, then the n - step fuzzy relation L(x, y) may be 


defined by the equation: 


Jak Xs y) i: ies minL A(x, Xs Mx aaihs eceeee 9 


Xi Xayere XRG K 
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Similarly, the (n+1) - step fuzzy relation is given by 


Pera(Xs ¥) = max mint Mlxs x )o fale Xade veers 
; VL OP Ais sy canny SAE 


It may be noted that 


Therefore, 0 & AAlx 


Muril%s ¥) eee Sa: pens eh tee 
felXnr s Xn)s f(Xns Y 
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similitude A(x, y) is in the closed interval between zero 


and one such that 


j = 


felxs y) = Vim) Aanl%s y). 
As an example, Jet X = {x,, ee BOS Sheen/Abe X5 yes 
1, 2, 3, 4 given as in Figs. 11 and 12. 
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Fig. 11. 1 = step Fuzzy Relation Matrix of Elements 


Fig. 12. 1 - step Fuzzy Relation Diagram for 4(x,, x;)'s. 


The complete fuzzy relation &£(x;, x;) = KAXy x; ) is shown 
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Fig. 14. Complete Fuzzy Relation Diagram for A(x; s x; )'s. 


When the set of patterns X has a finite number of 
elements, it is sometimes convenient to represent the fuzzy 
relation between elements in matrix form. We represent the 
LUAZVe Geld tilonematrEx & edS., 

F= | atx., x; il 
Whene 1, j= 1,92, .seece 5 Ny Xs “and X; are ineke anu sms 
the number of elements in X. 

For a fuzzy relation matrix A, we denote the (i, j)th 


entry by a;; where 04 aj, * 1. We may then define: 


Db 70 24 Ge ieee Gl chk "Gee OTs ial Cy tay at Wh te 


2) I =IIm5;il; 


wig 1 Gan, 
where mi; = plete 


Be Ge=eA eB uf and Jonlyens ene max min(ay » Dig). 


A ae ay eee eee Vs 


_ ij i 


7) ras a) eltardacn > 
Glee Cmeaemax.@h, eB a1! dnc ON) Yesiieicr ame 


Thus, all diagonal elements of any fuzzy relation matrix 


by; otherwise. 


31) 


‘ 
ae 
— i. ; 7 oo 
At \ sf a 
} a 
PR 
a r 14 | “ f mwor.ect ' v3 i] i 
‘a } £ Pee | A Mg Nt PS L 4 j ee 6 | Ane en? node 
. nt _ 3 eo " Lf S70 Provibs' 2awtaeeee 2Foer + einemets Z 

i ; _ 3 
() geoastobs oh ¢ cote? 2 )eweel ay! epapiete. speed Bereates 


,2h 7 2evdam notated viel 7 


6 
4" 
: 


, | 
it kt ops ag = 4 
’ * = 
7. oa pA iv? iva A Ob ya a i « ivoded 2 | a t, ~t otodw 
1M etesmels Yo vedaun 68f 
- A a r 
Wet), ht) Sh) -efhneh, ww ,4 wi wean nePidateu vasut & 204 
HAe deb, none a ly “Wy i od ay? 0) or0A% a ie 
ent tat Mla sod af & ae 01 “hoe, bee Th ODA my 


: Hyaed * I : 


; st .* i v [ ; , 
Pr : ate iad * yf seni d = 
a 


a 
ne 


= i 


etal gelinda fam © gh HUTA be Th de A 


a 2 


Feare equal to unity and@PXeP <9 <2 .). 0. f-) = f= 


Stee ee Lhe. Cie j)th entry of F" is the value of An(X; 5 X;)- 
Hence, the complete fuzzy relation £(x,, x;) may be calculated 
rather easily and quickly by successive application of 


K K 2 


le eres SHAN SR ot 4 See collie Gy Sree ere 


As an example, let 


] 0.8 0 0.1 0.2 

0.8 1 0.4 0 Oe!) 

Pas 0 0.4 1 0 0 
el 0 0 1 0.5 

0.2 0.9 0 0.5 ] 


nett Ne OE 


0.8 el a oh ds 
poe 10.4 0.4 1 0 0.4 
tee es 0 pete 
ee POO Te Wilk 1 
and Fi = Fe F” 
hy Oe Tek ea ee! 
0.8 Beith Wel OS 
eo Wath tae 1 0y 4 ond 
0.5 0.5 0.4 TeerOss 
we Wee Wel Wes ib 


ligt 4epke pe Ueeie Bu lp eae Boren. thus ier meer, 


pee cat baat 


~~ « a —- >i ae he y 


haperustas We vam Oye -g 


satnadiad aa 
esol Re edo han 
i oki ate “ot 


vat alineas 0 


io nettasl (oie | 


le hel | 0) 
hs 0 “*,0 
1 ‘ 
vi . 
, ae f 
r dua 0 


i] oy 
a0. Wp ia 


248!) : ho 60 


ey en 
CP to 
ae ee) 
| Bb ho 88 
aes Se yaea hb cu.» Avaya! “4 1a * ‘490 


oi 
_ 


a 


55 


CHAPTER IV 
KARHUNEN - LOEVE EXPANSION 


4.1. General. 

Before an input pattern can be fed into a classifier, 
there are two problems that must be solved. The first is a 
sensing problem; the designer has to decide what is to be 
measured from the input patterns and how to associate 
measured quantities with individual characteristics. Usually, 
the raw data being measured is of high dimension, it contains 
much redundant and insignificant information that contributes 
little to the recognition process. The second problem is 
concerned with feature selection and preprocessing; the 
preprocessor attempts to select discriminatory characteristic 
features or attributes from the input pattern space. 
Feature selection is of major importance in pattern 
recognition systems, it reduces the dimensionality of the 
input measurement vector and in turn greatly reduces the 
computation time required in the recognition process. The 
present chapter is devoted to discussion of a possible 
solution to the second problem; discussion of the first 
problem is deferred until Chapter V. 

The selection of features can generally be divided 
into two phases, namely, preselection and postselection (35). 
Preselection requires no knowledge of class samples. 
Features of the input pattern that are known to be 
ineffective in discriminating between individual objects, or 


are judged to be useless from a general knowledge about the 
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nature of the desired class, will be eliminated in this phase. 
It is usually performed in conjunction with feature extraction 
which is the first stage of any automatic pattern recognition 
System. On the basis of the collections of sample patterns 

at hand, postselection selects from the sample patterns those 
features which are most effective in distinguishing samples 

of one class from those of another. A number of methods 

have been proposed for postselection, and the author is 
particularly interested in a method derived from an optimal 


expansion known as the generalized Karhunen - Loéve expansion. 


4.2. The Generalized Karhunen - Loéve Expansion. 


An optimal feature selection and feature ordering 
procedure may be developed based on the generalized Karhunen - 
Loéve expansion. The procedure depends on use of an 
orthogonal transformation of coordinates in the representation 
space in order to obtain the optimal coordinate system with 
weights sharoly concentrated in a few coordinates. 

Application of the Karhunen - Loéve expansion to feature 
selection was first suggested by S. Watanabe in 1965 (27) 

and further developed by K. S. Fu and T. Y. Chien in 1967 (28). 
This feature selection and ordering scheme has the important 
characteristic of not requiring the computation of the 
probability of misrecognition, or complete knowledge of the 
probability distribution of the input pattern. It is 
applicable to any data processing system where a reduction 

of dimensionality or compression of information is desired 


prior to subsequent processing. The basis of the scheme is 
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a preweighting of features according to their relative 
importance in description of the input patterns. Relative 
importance is meant in the sense of carrying more information 
regarding the discrimination of pattern classes so that use 
of only a finite number of these features introduces a 


relatively small error, 


4.2.1. Derivation of the Generalized Karhunen - Loéve 
Expansion. 

Consider observation of a stochastic process 
{x(t), Oat Th over a period of time (0, T), the observed 
random function Pas Fe ee tees Tt being generated from m 
possible stochastic processes Weel Oh 3 eS Tye libel oc 
Pee aiipandem >. coc, COYVmespOnding tO0am Datternuc lasses. welet 
P; be the probability that the ith process occurs, and 
suppose 2 Pi = 1. We wish to express the random function 


Cu) eunecnee torn: 


Xe(t) = ZVin O.(t) for all t € (0, T) and 
i > he cae eececee r) Ms, PAz 1] 


where the V's are random coefficients that satisfy the 
relation E(Vy, ) = 0 (achieved by centralizing all the random 
functions). The set {ou(t)} is a set of deterministic 
orthonormal coordinate functions over the range (0, T), that 
Sie 
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where the * indicates the complex conjugate and One is the 
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Kronecker delta function equal to one if k = 4 and equal to 
zero otherwise. 
Weadétine a scovariancartunction K(t,as)) tor them 


stochastic processes as follows: 
m 
K(t, s) = 2P EX; (t)xf(s)]. [4.3] 


SsubStTouLIng i491] antoelags | tives 


2, ELL Vin Bult (t)}4 Z vie 02 (s 
Eo, (t)O¢(s) z Pe (VeesVep Je [4.4] 


K(t, S:) 


If furthermore, the random coefficients Vi, 's are chosen to 


satisfy the conditions: 


mm zy 2 
ZPE(Vik Vig) = 2P,Var(Vin) =O% if k = 2B, 


4.5 
0 Peeve 


= * 
and 2PLE(Vik Vig ) 


then the covariance function K(t, s) can be expressed in the 


form: 
K(t, 8) = Zogo,(t)Oe(s). [4.6] 


In other words, if the expansion in [4.1] exists for 
the random function X;(t), and the random coefficients Vi, 
Satisfy the conditions in [4.51], then the: covariance function 
K(t, s) can be expressed in the form as shown in [4.6]. 


Furthermore, from [4.6] it follows that 
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Then, if the summation and integration may be interchanged, 
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then 


v 
Jace, Ss). (s)ds E08 eit) | 04 (s)an(s)ds 


Sa0y hilt). [4.7] 


Therefore, fax} and {o.(t)$ Satisfy the integral equation 
defined in [4.7]. The expansion in [4.1], whose orthogonal 
coordinate functions Fon (t)t are determined by [4.7] through 
the covariance function K(t, s) is called the generalized 
Karhunen - Loéve expansion. 

In the terminology used to treat integral equations, 
the {o,(t)} are the characteristic functions, or eigenfunctions, 
and the font are the characteristic values, or eigenvalues, 


Otethe covariance runctionak(t,es). 


aoe Optimal Properties of the Generalized Karhunen - Loéve 
Expansion, 


Fu and Chien have remarked that there are two optimal 
properties for the generalized Karhunen - Loéve expansion 
(28), namely, 

1) The expansion minimizes the mean square error 

that results by selecting a finite number of terms 
in the infinite series of expansion. 

2) The expansion minimizes the entropy function 

defined over the variances of the coordinate 


coefficients in the expansion, 


ASoeie Derivation of tne First Property. 
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coordinate functions and let [4.1] be rewritten as: 


n 
MCE) = ZV Cue aE) oe | eee ees 

[4.8] 
where R,, (t) is the remainder when the expansion terminates 
at k = n. Define the expected value of the square of the 
modulus of the remainders by ZP.ELI Ren (t)I* J. We wish to 
find the coordinate functions which give the best approximation 
to the random functions X.(t), in the sense that among all 
possible expansions having the same number of the terms, the 
particular choice of coordinate functions minimizes 


™ 
ZPLEC[Rin (t)I* I. Writing out the expansion one obtains, 
= 


ge Gaye 


ZP.E CI Ran (tI J 
= EPPEL{K s(t) - Zoe Y(t)h 


{xCt) " Erasgioly 
= 2 Pi ELX.(t) x, (t) = Z Vin Y(t eit) 
- Fe elt t) + sein: Began 


- Erdecly fa ] - ae tYELVie XF(t)] 
= Z Olt VELVig X; (oy 
$3 ey ik Vig ) ) A(t) AG: ) 


wht Az\ 


= pay ELIX;(t) ‘i 
pt Ps E(Vag VE) Alt) Ar) 
=) ket 11 


=, Y(t), 13 onre) cet) 


Ks\ 


5 yr gr, ELV, Xs Gta [4.9] 


Substituting [4.1] and [4.5] in terms of the generalized 


Karhunen - Loéve expansion, we have 
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Z PEM KME)] = Zrelval 2 vie or(t)h 


= OZ ZF ECV Vie )192(t) 
Se Oye Ot le [4.10] 
Similarly, 
ZPSELVR X,(t)] = OP, (t). [4.11] 


substituting [4.10] and [4.171] into [4/9] qives 


ZPyeC(Xs(t)/7] + Ae p, 
= Fonwelt) Qt) fae t) grt 
Bp, ELIX4( yee +e re >) 

ae t) Bq (t _ 9, (t)p DB. ( : a 


ZPLECIR a Gaipar 


It can be shown that 


(W(t) - Bt)pe(t) - Altoe(t) = [Q(t) - oy. (t)]° 


- [9x(t)| 
Therefore, [4.12] can be rewritten as 
ZPECIRin (tI? 1 = ZPrECKs (tH? ] - Zon | (ert 


+ Z onl Alt) “Lip eal 


Obviously, the minimum value of ZPECIRin (t)]* J is attained 
at W(t) = 9,(t) where 0, (t) is the generalized Karhunen - 


Loave coordinate function defined in [4.7]. Thus 
: = 2 E _ 2 son t 2 
min{ Z PsEC(Ram(t)* If = APEC (xy (eI 1 - 2% [Pal t) 


It may be noted that the necessary conditions which allow 


the expansion of [4.1] to be the generalized Karhunen - Loéve 
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expansion defined in [4.7] are 
iP ifek sels 


™ 

ZPLE(Vie Vigi ) = 
& iL 

y 0 gid fet ad oe 


The relation implies that the random coefficients between 
each pair of coordinate functions among all classes of 
stochastic processes should be uncorrelated. It is noted, 
however, that the random coefficients between each pair of 
coordinate functions for ,a single class shnouldenot be 


uncorrelated. 


4.3.2. Derivation of the Second Property. 
Let X;(t) be square integrable and normalized such 


that 


faxycor dthetlpe ot GRGO Sal) candeig=el faen Saeaeaann: 
[4.13] 
ses 2 
Then from [4.1] we can show that ZlWil = 1. The proof 


is as \follows: 


[x,Ce = xe) 
[Z Vix Oe(t)ILZ Vie Oe (t) J 
EN it Vas Dy(t) D2 (t). [4.14] 


Substituting 14.14] into [4.131% we have 
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Therefore, from [4.13] and [4.15], we may conclude that 


= 2 
Z |Wanl =<) bio wemdet ine oO, for each coordinate function 


O.(t), k 


is Cs eee¢eo as 
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Ox 


Z Ps EW Vail” 
Pr 


2 
where the O,'s are the eigenvalues of the integral equation 


defined i1n® (427 3%" then 
Co oo Mm 2 

= q [% 
Z. 26 PSE [Vane 


= FP,Z EVaxl” 


i tol Ld | 


It may be noted that P,* 0, therefore, the P's form a 


probability distribution on the set of generalized Karhunen - 


Loéve coordinate functions 1H t)S 


Now define an entropy function for the Ps's of the 
{9 (t)} as 
co 
HL{Di(t)F 1 = - 2109 Pr [4.16] 


lig the Px’ s are ordered such that 


p. aN P2 > seeee Pr s Par Reeve 


then for any other similarly ordered Ais associated with 


any arbitrary set of coordinate functions {Q.(t) we have 
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Thus in terms of entropy 


fe Px loge Py & -2.Ax loge Ax p4ei18)] 

and 
HL4G(t)t] = min HLL, 4.19 
fot} im LY t)}] [4.19] 


4.4. Discrete Equivalent of the Generalized Karhunen - 


Loéve Expansion. (28) 


Suppose, instead of continuously observing a random 


function het) over a period of time, sampled measurements 


are taken from the random function in the following form: 


X. = Gs Xin » eoeoe 9 Xen )s i =H, as eecce all's 


where each X, is a random vector of k components (k is 


finite). The desired expansion then becomes 
ieeariioe ts. [4.20] 


where the Vi, 's are the random coefficients and un, is the 
jth component of the nth orthonormal coordinate vector in a 
set of orthonormal coordinate vectors 49,4 which is analogous 
to the set of orthogonal coordinate functions {9x(t)¥ in the 
continuous case. If we define the discrete analog of the 


covariance function K(t, s) for m stochastic processes as 
de x 
PP sECXat > Xis8) where cre S = ike 25 eceete ’ Ke 
I 


then 
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Tr 
KCtas)o=) ZPpE( Xe 3 0X) 
ae as Sie Fe ean 
= ZPLEL( Z Van Une )(2 Vig Yes )} 


eal % 
Unr Ups 2 Ps BUvee Vig ) 


ee? Opies Utes [4.21] 


Furthermore, by the orthonormality of the coordinate vector, 


we have 
G Sea 
Zk(t, s)u. = oo (4 eet Wes) Uns 
oa Pip 3 aes 
= Zcor Z Use Ung 
eo 2 
Zn 'te Stn 
E Ofu pp yaya] 


Therefore, the generalized Karhunen - Loéve expansion for 


the discrete case becomes 


ao 


Xs = Man Ung j = 13 28 eeceooe 9 Ms, 


where Un; is the jth component of the nth orthonormal 
coordinate vector satisfying [4.22]. The random coefficient 


\. 


+, is determined for each n by the equation: 
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It may be noted that the coordinate vectors u,,'s of the 
generalized Karhunen - Loéve expansion are essentially the 


eigenvectors determined from the covartancesmathtxmk tee 
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4.5. Practical Application of the Generalized Karhunen - 
Loéve Expansion. 


The generalized Karhunen - Lo&ve expansion and its 
optimalities may find practical applications in designing a 
suboptimal procedure for feature selection and ordering in 
pattern recognition systems. In automatic classifcation 
systems the good feature observations, which are the most 
representative and informative, should be selected by a 
preprocessor prior to initiation of the recognition process. 
One may choose to minimize the mean square error and select 
the coordinate system (the generalized Karhunen - Loéve 
system) whose coordinate coefficients represent the pattern 
samples of different classes in the most significant manner. 
The minimized entropy property of the Karhunen = Loéve 
expansion implies that the linear transformation produces 
the most efficient information compression over the 
coordinate system in the sense that most of the random 
coefficients are concentrated in a few coordinates instead 
of widely distributed among all of them. 

By properly constructing the generalized Karhunen - 
Loéve coordinate system, and arranging the coordinate 
functions fa. (tt (continuous case) or the coordinate vectors 
1 Ot (discrete case) according to descending order of their 
associated eigenvalues C,, the resulting feature observations - 
will always contain the maximum information about the input 
pattern samples whenever the recognition process terminates 


at a finite number of measurements. 
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In practice, it is difficult to construct the desired 
coordinate system through the integral equations. However, 
one can achieve the same purpose by simply recognizing and 
applying the necessary and sufficient conditions for the 


expansion to be the generalized Karhunen - Loéve expansion (36). 


4.5.1. Necessary and Sufficient Conditions for the 


Generalized Karhunen - Loéve Expansion. 


Fu and Chien stated the necessary and sufficient 
conditions for a generalized Karhunen - Loéve expansion to be 

1) ZPiE(Vix Vig ) = OK Ske, 

where oh is the Kronecker delta function, 
and 2) Oy = ZP; Var(Vig )« 

The proof for sufficiency has been given in sections 
SVS, 4.3.1. and 4.3.¢. during derivation tof the optimal 
pronerties of the generalized Karhunen - Loéve expansion. 

The necessity may be proven as follows: 

Assume that the covariance function K(t, s) is defined 

as 
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where fo,.(t)t are determined by 

is 2 
Lk(t, s)0q(shds = Oe Oe(t) 
t0) 


It may be noted that 


Nae fxsceroecerae [4.24] 


0 


which is obtained by application of integration to (rae e 


’ ) | | S 


bevtea® pil ‘oui iO bite a ina saa 
| yawelt oro pees ranean wht iouerat: mot Grats 000 
Ae THe etre boss, ot Rey + davqrun ms ne, erp if a 188 
Te ks oe ak iets vv ye 4 cia renmi ed | vt 
et) Qadenaaye veh . remy wil bas! | sang ant ad 08 


an 
| | 7 

fy Pye? Tee Pati ys bing. yoeeze 
- Co foal — + A. a = van 


ay . an et i Aan ake wetate nadie. baa a a 
ania 10.8 AW et ise ws sett thao 
‘5G re) | ay «vi as {T r 
Pas SAV Sas oN eb waa ( - 

Wet ag : 4 (st eo 
i" ‘| Spee 
2 i> (HORN OPOR AER. § arta Powe in tou tA oft oa 
°. i ie 
‘evdaenoddd tylab aw Fan coybee kook Gea of 02 oof Sah 
joteieiae oho = aeiiies? bay? feradeg ans Yo sahaeae te ' 
wo ee tf a bhi oa van al 
banhteb Fe oth) aeraowle ona skyga al? fede May 
a . 


25 


(nt) ashy % at fe, ‘a 
en eee wee wee ays 


se 
ED * eben ‘ ya] 


. — yk areas 


68 


GVeRetire tranges(O7tT jam Similarly: 
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UR eed Jxsrog(sas. [4.25] 
substituting [4.24] and [4.25] into condition 1, it follows 
that 
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Therefore, construction of the desired coordinate system is 
equivalent to finding the coordinate function (or vector) in 
which fie coordinate coefficients are mutually uncorrelated 
so that equations in [4.5] are satisfied. The procedure is 
basically that of de-correlating the coordinate coefficients 
over the ensemble of all pattern samples from different 
classes. In many recognition processes where the covariance 
functions are real and symmetric, this de-correlation process 
simply amounts to diagonalization of the corresponding 


covariance function. 


4.6. Procedure for Formulation of the Karhunen - Loéve 


System. (20) 


Consider a pattern recognition system CONSisting of 


preprocessor and classifier. The preprocessor is designed 
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to select and order the features by choosing an optimal 
coordinate system, Let the set of features that describe a 
pattern sample be denoted by the vector, 

X,. =.(dks oc Xeeete ee Kee) 

wiere. Jt isp thes index) forapatterneclass¥ei'] is 2he.eee ee ti, 
(m = the total number of classes) and k is the total number 
of features observed for each pattern sample. We define 
feature selection and ordering as the process of deciding 
upon the proner sequence of feature observations for a 
particular classifier. The proposed optimal coordinate 
system (the generalized Karhunen - Loéve system) can be 
determined by application of the following steps which are 
also summarized in Fig. 15. 

1) Obtain the covariance function K(t, s) from the 
given sample measurement vectors. If the components 
of the sample vectors assume real values, the 
covariance function K(t, s) is a real symmetric 
matrix. 

2) Find the eigenvalues, and the associated 
eigenvectors, of K(t, s). Let the eigenvectors be 
normalized and lexicographically arranged 
according to descending order of their associated 
eigenvalues. The set of orthonormal vectors thus 
obtained constitutes the generalized Karhunen - 
Loéve coordinate system. 

3) Make the linear transformation as defined in [4.23] 
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orthonormal eigenvectors obtained from step 2, 
The resulting Vj, 's are the desired coordinate 
coefficients in terms of the generalized Karhunen - 
Loéve coordinate system. 
lt may be noted that the set of Vin 's is the set of 
new (or transformed) Paatiras to be observed by the classifier. 
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generate the 
observation vector 
Xi = (Xin X Kite) 
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vector and the 
covariance matrix 
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for the covariance 

matrix 
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CHAPTER V 
DATA BASE 


5.1. The CACM Data Base. 

The data base used to test the proposed automatic 
document classification system is composed of keywords 
selected from 179 titles of papers published in CACM, the 
Communications of the Association for Computing Machinery, 
during the years 1968, 1969, 1970, and 1971 (Volumes 1] - 14). 
All these papers were pre-classified when published; they 
were classified according to the CACM classification schedule. 
These pre-classified documents can be used as the input test 
data for the proposed classification system, they provide 
standard answers to measure the efficiency of the system. 
Significant keywords, which emphasize the dissimilarities 
between papers of different classes, are selected from the 
data base. For the individual classes certain sample 
statistics, such as the sample mean vector and sample 
covariance matrix, are extracted and form the required a 
priori information for the Bayes’ classifier. 

The CACM data base is punched on cards. An individual 
datum in the data base consists of keywords selected from the 
title of a document together with a pre-assigned class number 
of the document. The pre-assigned class number constitutes 
the last logical record of each physical record. For each 
card of 80 columns, the first two columns are reserved for 
a right justified integer that serves as a header to indicate 


the number of subsequent logical record fields that contain 
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information about the document. The remaining 78 coiumns of 
the card are divided into 13 logical record fields each of 
six columns. Left justified truncated keywords of five 
characters are stored in these logical record fields, and 
the sixth column of each field is always left blank. Each 
keyword of less than five characters is padded with 
appropriate number of blanks on its right hand side. The 
pre-assigned class number is a digit between 1 and 5, and 
the number is placed in the second column of the last 
logical record. 

The instance in which the header equals 13 requires 
special consideration. If the thirteenth logical record 
field is occupied by a digit in the second column and by ' 
blanks elsewhere then the document information occupies 13 
logical record fields, and the digit in the second column 
of the thirteenth logical record field is the pre-assigned 
class number. On the other hand, if the thirteenth logical 
record field is occupied by a string of alphanumeric 
Charactersa with the. farstecharacter stantindadt romectinemt ans. 
column of this field, then the document information occupies 
more than 13 logical record fields and further information 
is recorded on the next card. 

A listing of the keywords selected from the title of 
individual paper in the CACM data base can be found in 


Appendix 1. An example of a single documnet record is 


the following: 
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SANALY TIME SHARI TECHN 1, 


It is a document whose pre-assigned class number is 1. 


5.2. Selection of Keywords, 

Three types of features may be measured from a pattern, 
namely, physical features, topological features, and statistical 
features. Physical and topological features are commonly 
found in the recognition process used by human beings. Such 
features are easily detected by human eyes, hy touch, and by 
other sensory organs. Since computer lacks human sensory 
organs, the physical and topological features are not the 
most efficient features in automatic recognition processes. 
However, the computer may be designed to extract mathematical 
or statistical features from sample patterns which humans may 
have difficulty in determining manually. When the patterns 
in each of the m pattern classes are random variables 
governed by m distinct probability density functions the 
computer may be taught to perform classifications based on 
Sano Le statistics. 

Computers have memory and are capable of recognizing, 
comparing, and identifying words. In automatic document 
classification, the document is often represented by a set 
of selected keywords. The computer performs recognition 
based on identification of certain keywords rather than on 
understanding of their semantic meaning. Words related to 
the original keywords may be added in order to amplify the 
information in the stored document representation. Given 


a set of documents of a data base, the assignment of keywords 
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for each individual document constitutes the sensing problem. 

It is seldom practical to assign keywords by reading 
manually over the entire document. Not only would such a 
method involve too much manual work, but the keywords so 
assigned would necessarily be based on subjective considerations. 
They would only reflect a particular indexer's interpretation 
of the contents of the paper. Much manual work may be saved 
by selecting keywords from an abstract of the paper; however, 
abstracts are not always present in scientific papers. Also 
a relatively large amount of processing is still required. 

Statistical examinations show that the authors of 
scientific papers tend to choose the titles of their 
publications with care; the title of a scientific paper often 
gives a good indication of the contents of the paper (37). 
These titles often provide a satisfactory source of 
appropriate keywords for scientific papers. 

Elimination of keywords that are useless in the 
recognition process is a legitimate strategy in information 
compression. It has therefore been used in preparation of 
the data base for the present investigation. In selecting 
keywords from the title, any insignificant words such as the 
articles, the prepositions, and the conjunctions, which 
contribute no discriminatory information in the recognition 
process, have been ignored. To provide a standard for 
consistent keyword expression, all the selected keywords 
were expressed in singular form. Each keyword selected was 


right-truncated to five characters. This procedure helps to 
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eliminate the parts of speech problem since words arising 
from the same stem tend to be coded in the same form. As an 
example, the words such as computer, computers, computed, 
computing, computation, computational and computibility are 
all from the same stem "compute". When truncated to five 
Characters, all these words are coded and stored in the data 
base as "COMPU", Truncating keywords also has the advantage 
of saving storage which may well constitute the major cost 


in manipulation of a large data base. 


5.3. Selection of Classes. 

The Communications of the Association for Computing 
Machinery uses 13 classes to classify the submitted papers 
in Computer Science. These 13 classes are as follows: 

ime OORT Gini, 

2) Computer Systems, 

3) Education, 

4) Graphic and Image Processing, 

5) Information Retrieval, 

6) Management / Data Base Systems, 

7) Management Science / Operations Research, 

8) Numerical Mathematics, 

9) Operating Systems, 

10) Programming Languages, 
11) Programming Techniques, 
12) Scientific Applications, 
13) Standards. 


Owing to the nature of the subjects involved in class 
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and class 13, the documents in these two classes are 
difficult for recognition by an automatic document 
classification system. There was only one paper in the 
field of "Graphic and Image Processing" published in the 
CACM between the years of 1968 and 1971; therefore, there is 
not enough data to form sample statistics for the documents 
of class 4, For economic ‘reasons, the author used five 
classes of the remaining ten classes to test the efficiency 
of the proposed automatic document classification system. 
The five chosen classes are as follows: 

1) Computer Systems, 

2) Information Retrieval, 

3) Operating Systems, 

4) Programming Languages, 


5) Programming Techniques. 


Selmer tacrstics., of the CACHeDatarBase. 

The CACM data base used in the present study is composed 
of keywords selected from 179 titles of papers published in 
the Communications of the Association for Computing Machinery 
during the years 1968, i1969,81970, "and 197i) inhese keywords 
are selected from the titles of documents in the five selected 
classes. The distributions of the sample documents in these 


classes are summarized in Table I. 
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Description NOG Ot Documents | Percentage | 


a 


Computer Systems | eae 


tt a 
18.44 


Information Retrieval 


Operating Systems 


Programming Languages 24.58 


Programming Techniques 34.08 


Table I. Distributions of the Sample Documents in the 5 Classes. 


The above statistics show that an average of 36 sample 
documents are used to represent each class, and the sample 
documents in class 5 constitute the largest bulk of the data 
base (34.08%). 

A total of 909 selected keywords are used to describe 
the contents of the 179 sample documents, and this gives an 
average of 5.1 selected keywords per document. 

There are 381 distinct keywords in the CACM data base, 
therefore each selected keyword occurs on the average of 
2.6 times in the entire collection of the sample documents. 
A listing of the 381 distinct keywords in the CACM data base 
is shown in Appendix 2. 

Fourteen is the maximum number of keywords used to 
represent a document in the data base, identical keywords 
in the same title appear only once. A minimum of one 


keyword is used to represent a document. 
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CHAPTER VI 
THE PROPOSED CLASSIFICATION SYSTEM 


Grieerniroduction. 

The creation of the proposed automatic document 
classification system can be divided into six phases, 
namely, 

1) Feature Preselection Phase, 

2) Association Measures Assignment phase, 

3) Complete Fuzzy Relations Assignment Phase, 

4) Feature Selection and Feature Ordering Phase, 

5) Sample Statistics Estimation Phase, 

6) Classification Phase. 

The first three phases are designed to determine the 
strongest possible fuzzy relations between the distinct 
keywords in the data base; while the feature selection and 
feature ordering phase serves to provide a compressed data 
base for the classifier. The sample statistics estimation 
phase necessarily precedes the classification because the 
Bayes’ classifier is chosen for use-ani the classification 
Process. Thiseclassit ver requairesmithemsiampae Stat 1Sit ese 
such as the mean feature vectors and the covariance matrices, 
to serve as the statistical data for classification. Thus 
the first five phases are designed essentially to prepare a 
suitable data base for the Bayes‘ classifier so that an 
efficient classification system may be obtained. 

It may be noted that much computation time is required 


in the first five phases of the proposed system because they 
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necessarily have to involve all the distinct keywords in the 
data base. However, this large amount of computation time 
is worthwhile in the sense that these phases prepare a 
compressed data base for the Bayes' classifier; the process 
of classification is much simplified by the fact that only 
a few significant features will be observed and they form 
the bases for classification. Since the first five phases 
are required only once, in the long run, the computation 
times that are saved in the classification process will more 
than compensate for the initial large investment of time. 

A detailed description of the creation of the proposed 


system will be presented in the subsequent sections. 


Mmoeceamreature Preselection Phase, 

Since the title of each document is used to describe 
its content, the titles of all the sample documents in the 
data base were first examined manually by the system 
designer. This was in order to choose the important words 
from the titles of the documents. The selected keywords 
were, in fact, most title words other than the prepositions, 


articles, and conjunctions. The designer also had to use 


his own intuition to exclude those title words which obviously 


would not add to an understanding of the document's content. 
In the instance that the designer was given a document having 
the title “Improving Round-off in Runge-Kutta Computation 
with Gill's Method", he should describe the document with the 


following eight keywords: 
IMPRO ROUND OFF RUNGE KUTTA COMPU GILL’ METHO. 
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It may be noted that hyphenated title words were treated as 
two separate words; the selected keywords were all expressed 
in singular form and were right-truncated to five characters. 
The selected keywords of each document together with 
fits pre-assigned class number were punched on card(s), and 
this string of information was regarded as a physical record 
Guatietcdata base. sine collection of cands whom ai othe Ssanpie 
documents constituted the data base for the proposed system. 
The procedure used for the feature preselection phase 
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The Flow Diagram for the Feature Preselection Phase. 
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6.3. Association Measures Assignment Phase. 


For each selected keyword of the sample document, it 
Ws) necessary to assiqn a vector of values to indicate the 
associations between the selected keyword and the other 
distinct keywords in the data base. One way to obtain these 
values is by finding the statistical association measures 
between the distinct keywords in the data base. 

The strings of information for individual sample 
documents in the data base were input to the computer. The 
selected keywords and their corresponding occurrences in 
terms of document numbers were recorded in logical records. 
Thus each logical record contains a keyword and its 
corresponding occurrence. The keywords in the logical records 
were sorted by the IBM sort and merge package in order to be 
ranked in the ascending order according to the ASCII code 
(the ASCII code places the special characters in front, with 
the alphabetical characters in the middle, and the integers 


Stat he wenG)ic 


6.3.1- Document Term Matrix. 


The sorted logical records were input to the computer. 
The distinct keywords were extracted from the sorted keyword 
list. At the same time a document term matrix, with the 
distinct keywords along one axis and the document numbers 
along the other, was formed. This matrix indicated the 
keywords that would be found in a particular document. Those 
distinct keywords of the data base that did not appear in the 


title of a particular document were tagged with the logical 
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constant "false", and the distinct keywords that were 

present in the document were tagged with the logical constant 
"true". This arrangement allowed some Saving in storage 
because a logical*1l matrix could be used to store the 


required information, 


6.3.2. Term Connection Matrix. 

The number of times of co-occurrence of a particular 
pair of distinct keywords in the sample documents of the data 
base can be measured by linking the keywords through their 
corresponding occurrences. A term connection matrix was 
formed by multiplying the document term matrix with its 
transpose so that each element of the resulting term 
connection matrix indicated the number of documents that were 
relevant to that pair of keywords. The term connection 
matrix also gave some indication of the relations between the 
distinct keywords in the data base. Keywords that often 


appear together in the same documents may be regarded as 


having a close relationship. 


6.3.3. Term Relation Matrix. 

The term relation matrix may be regarded as the 
normalized version of a term connection matrix such that 
each element of the term relation matrix satisfies the fuzzy 
set property. In other words, the elements of the term 
relation matrix are all governed by a membership function 


and all their values lie in the closed interval between zero 


and one. A statistical association measure formula proposed 
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by L. B. Doyle (12) was used to perform the required 
conversion. The association measure formula may be expressed 


in the following form: 


GM (is saras) 


ERMC eed se= 
WC Mer) eT CM Gy Sj en CM ey) 
where TRM(i, j) is the ith row and the jth column element of 
the term relation matrix which indicates the similarity 
relation between the ith and jth keywords, 
TCM(i, j) is the ith row and the jth column element of 
the term connection matrix which indicates the number 
of documents relevant to both the ith and jth keywords, 
TCM(i, i) is the ith diagonal element of the term 
connection matrix which indicates the number of 
documents. relevant to: the ith keyword, 
and TCM(j, j) is the jth diagonal element of the term 
connection matrix which indicates the number of 
documents relevant to the jth keyword. 
The texpressionmiaCh( tee WicariCM (ior ike MeChGea ta nee 
denominator represents the number of documents that contain 
either, but not both, of the ith and jth keywords, It has a 
normalizing effect and ensures that the value of every 
element in the term relation matrix will lie in the closed 
interval between zero and one, and thus satisfy the fuzzy 
set property. 
It may be noted that the term relation matrix is 


equivalent to a 1 - step fuzzy relation matrix because the 


calculated values of TRM(i, j)'s indicate the direct 
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relations between the ith and jth keywords, and all these 
values lie in the closed interval between zero and one. 


A summary of the association measures assignment phase is 
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Fig. 17. The Flow Diagram for the Association Measures 


Assignment Phase. 


Ow4e Compete Fuzzy Relations Assignment Phase. 


Many distinct keywords in the data base did not have 
direct relations with each other; therefore a considerable 
number of zeroes were present in the 1 - step fuzzy relation 
matrix. In order to extract the maximum possible relations 
between keywords in the data base, the indirect relations 


between keywords should also be used. An indirect relation 
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between a pair of keywords may be defined as the relation 
obtained when the keywords are linked together by a chain 
of n keywords that connects them. In the instance when one 
keyword is used as the connection between a pair of keywords 
the relation measure so obtained is known.as the 2 - step 
relation. The 2 - step fuzzy relation matrix may be obtained 
from the 1 - step fuzzy relation matrix by applying the 
following max + min composition operation in fuzzy logic (39): 

MegoalXs y) = max mint k(x, v)s Aglvs y)] 

VERSE SE Ze See 2aRens 
Consider, for example, a 1 - step fuzzy relation matrix 


with four distinct keywords whose direct relations are shown 


ite tg. VO 
K(1) K(2) K(3) K(4) 
4G 1.0 0.4 O87 OF3 
K(2) 0.4 1.0 0 Gee 
K(3) OF 0 1e0 O87 
K(4) 0.3 0.2 O87 1.0 


Fig. 18. The 1 - step Fuzzy Relation Matrix. 


From the 1 - step fuzzy relation matrix, it may be 
noted that there is no direct relation between keyword 2 and 
keyword 3. However, direct relations do exist between 
keywords 1 and 2 and keywords 1 and 3, with the resulting 
values equal to 0.4 and 0.1 respectively. Likewise, there 
are relations between keywords 4 and 2 and keywords 4 and 3 
with the values equal to 0.2 and 0.7 respectively. By 


applying the max - min composition operation on the | - step 
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fuzzy relation matrix, one should be able to obtain the 
indirect 2 - step fuzzy relation between keywords 2 and 3 
through keyword 4 with value equals to 0.2. 

Based on the fact that A*™ = A" © A", we can obtain 
the 2n - step from the n - step fuzzy relation matrix 
rather quickly and efficiently. Certainly, in order to 
obtain the complete fuzzy relations between the distinct 
keywords in the data base, we have to apply the max - min 
composition operation repetitively until the elements in the 
en - step fuzzy relation matrix are exactly the same as those 
in the n - step case. However, for economic reasons, the 
process may be discontinued as soon as most of the elements 
remain relatively stable, and the last 2n - step fuzzy 
relation matrix may be used as the complete fuzzy relation 
matrix for the distinct keywords in the data base. 

In our experiment, the elements remained relatively 
stable at the 32 - step, therefore, the 32 - step fuzzy 
relation matrix was taken as the complete fuzzy relation 
matrix. 

A flow diagram showing the procedure for finding the 


complete fuzzy relation between keywords in the data base is 
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Fide lo. whe Flow Uiagram for the Complete: Fuzzy 


Relations Assignment Phase. 


6.5. Feature Selection and Feature Ordering Phase. 


Of the four feature selection methods described in 
Chapter II, the Karhunen - Loéve expansion method seems to 
be most appropriate for the proposed automatic document 
classification system. The application of the Karhunen - 


Loeve expansion in feature selection requires determination 


of the distinct keywords in the data base. 
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6.5.1. Feature Vector. 

The sample documents in the data base were input to 
the computer. The selected keywords and their corresponding 
occurrences in terms of document numbers were recorded in 
logical records. These records were sorted by the IBM sort 
and merge package so that the keywords in the logical 
records were ranked according to the ASCII code. Each 
sorted keyword was compared with the keywords in the distinct 
keyword list. If there was a match between the sorted 
keyword and a keyword in the distinct keyword list, the 
corresponding document number of the sorted keyword was 
recorded. The relation vector between the matched keyword 
and the other distinct keywords in the data base was 
extracted from the complete fuzzy relation matrix; this 
relation vector forms part of the constituents of the feature 
vector! formthateparticularydocumente 

Consider, as an example, a document described by three 
keywords. Three relation vectors would be extracted from 
the complete zy relation matrix. By summing these three 
vectors, element by element, the feature vector for that 


particular document is obtained. 


6.5.2. Mean Feature Vector. 

In our experiment there were 381 distinct keywords and 
179 documents in the data base. As a consequence, there were. 
179 feature vectors each having 381 elements. Using these 
feature vectors to form the rows of a matrix, the resultant 


matrix was the feature matrix for the sample documents in the 
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data base with 179 x 381 elements. The mean feature vector 
was obtained by summing the elements in the matrix 
columnwise and dividing the resulting vector by 179 (since 


there were 179 documents in the data base), 


6.5.3. The Covariance Matrix. 

The covariance matrix can be obtained by applying the 
statistical formula in the form: 

Z= paix, - OU - 21 
where 2 is the covariance matrix of the data base, 

n is the number of documents in the data base, 


XG 


ets LNG féatume Vector for the ithecdocumentainecne 


data base, 

and X is the mean feature vector of all the documents in 
the data base, 
In our experiment, the covariance matrix was a 

381 x 381 matrix which described the correlations between 


the distinct keywords in the data base. 


6.5.4. The Transformation Matrix. 

The next step is to find the transformation matrix 
for the Karhunen - Loéve expansion. The transformation 
matrix is formed by the selected eigenvectors of the 
covariance matrix. In our experiment, the dimension of the 
covariance matrix was rather large, it was convenient to use 
the IMSL subroutine packages to find the required eigenvectors. 
The original covariance matrix was transformed into a 


tridiagonal matrix by the Householder reduction method. In 
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other words, the original matrix was transformed into a matrix 
having the information concentrated on the elements along the 
main diagonal and the two subdiagonals of the matrix. It may 
be noted that the details about this transformation must be 
saved so that the eigenvectors of the original matrix may be 
restored in the subsequent stage. The Q. L. algorithm which 
evolved from the Q. R. algorithm was used to find the eigen- 
values and the eigenvectors of the tridiagonal matrix. To 
obtain the eigenvectors of the original covariance matrix, 

we had to make use of the transformation details recorded 


previously. 


G25. Preature OYrderina. 

The eigenvectors and their corresponding eigenvalues 
were recorded in logical records. The IBM sort and merge 
package was used to arrange the eigenvalues in descending 
order. The eigenvectors corresponding to the top twenty 
eigenvalues were used to form the required transformation 
matrix. In our experiment, the transformation matrix was a 


20 x 381 matrix with its 20 rows formed by the 20 selected 


eigenvectors. 


65.0 pamtrne Compressed Data Base. 


According to the theory of the Karhunen - Loéve 
expansion, the compressed data base can be obtained by 
multiplication of the original matrix with the transformation 
matrix. The compressed relation matrix in our experiment was 


a 20 x 381 matrix with the 20 most significant keywords 
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along one axis and the 381 distinct keywords in the data 
base along the other. This matrix revealed the complete 
relations between the 20 significant keywords and the 381 
distinct keywords in the data base. A summary of the feature 


selection and feature ordering phase is shown in Fig. 20. 
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Fig. 20. The Flow Diagram for the Feature Selection and 


Feature Ordering Phase. 
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6.6. Sample Statistics Estimation Phase. 

The sample documents in the data base were sorted into 
9 Classes according to the pre-assigned class numbers. 
Documents from a particular class were input to the computer, 
and the selected keywords of each document were compared 
with the distinct keywords in the data base. If a match 
occurred, the relation vector between the selected keyword 
and the 20 significant keywords was extracted from the 
compressed relation matrix. The same procedure was repeated 
for all other selected keywords of the document. Summation 
of these relation vectors gave the feature vector for the 
document with respect to the 20 significant keywords. The 
mean feature vector for all the documents in that class was 
obtained by summing the feature vectors and dividing the 
resultant vector by the number of documents in that class. 
The covariance matrix for a particular class of documents 
was obtained by the following statistical formula: 

Yas, LEK = Ka) kate haa) cenl eae | are aes ae 
where 24 is the covariance matrix for class i documents, 

n, is the number of documents in class i, 

Xi, is the feature vector for the jth document in 

class i, 

and Na is the mean feature vector for class i document. 

In our experiment, the covariance matrix for each 

class was a 20 x 20 real symmetric matrix. The same sample 


statistics finding routine was used to obtain the mean 


feature vector and covariance matrix for other classes. 
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These sample statistics for the five selected classes form the 


required a priori information for the Bayes' classifier. 


Godee Classification chase. 

According to the theory of the parametric training 
method stated in Chapter II, the Bayes' classifier using the 
maximum likelihood discriminant rule for five classes may be 
expressed in terms of five discriminant functions G(X), 

Ue sage e. ce ss each sot. tne form sean 
C(x, - KZ Ky - 5) 


-tiog, |Z; \ rs 


) 


where by = logeP(W; 

\Z;\ is the determinant of the covariance matrix for 

class i documents, 

ees the inverse of the covariance matrix for class i 

documents, 

Aas tne Teature Vector sofr ine: qdocument tombe 
classified, 

and Re is the mean feature vector for class i documents. 

In the expression for b., DNes) 1S) the a onion 

probability of occurrence of class i documents; it may be 


calculated by the formula: 


SC Fe) ioe caer pears 


where Z- ig the number of occurrences of class i documents 
1 


in the data base. 


The five discriminant functions produced five scalar 
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values each indicating the likelihood that the document should 
be classified in that saeeac har class. The class number which 
corresponded to the maximum scalar value of the G.(X)'s Was 
assigned to the document as indicating the class to which it 
should belong. 

A listing of the classification program can be found 


in Appendix 3. 
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CHAPTER VII 
RESULPESPAND STAULS FICS 


Lee ew nes baba tar, 

The test data was made up of 125 documents that were 
randomly selected from the Communications of the Association 
for Computing Machinery between the years of 1968 and 1972. 
These documents had each been pre-classified by the CACM into 
one of the five selected classes, and they thus provided the 
necessary standard answers for the tests. The test data was 
divided into five groups, each consisting of 25 documents. 
The constituents of the first four groups were those documents 
published in the CACM between the years of 1968 and 1971; 
these documents were also part of the sample documents that 
made up the statistical data of the proposed classification 
system: .sIn.order to,test,for,the performance,of, thessysten 
on documents other than those in the data base, the fifth 
group included in the data base was formed from documents 
published in the year 1972. The proposed document 
classification system was applied to documents of each group | 
in turn and assigned each document to an appropriate class. 
It may be noted that a "two ranks classification system" 
was used in the classifications; each document was first 
assigned to its most appropriate class (the first rank 
classification) and then assigned to its next appropriate 
class (the second rank Classification) .ee his S¥Stevenas wie 
advantage of increasing the percentage of recall, and the 


second rank classification usually proves to be very useful 
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in real life classification problems. 


Pecag Programming. Details. 

Initially, all the papers published in the Communications 
of the Association for Computing Machinery between the years 
of 1965 and 1971 (7 years) were used as the sample documents 
in the data base. This accounts for a total of 307 documents 
with 4055 keywords and 1008 distinct keywords. The keywords 
were selected from both the titles and the abstracts of these 
papers. However, when constructing the term relation matrix, 
it was found that the program required at least four times 
One million: bytes for storage (the term relation matrix is a 
1008 x 1008 real valued matrix). Since the computer centre 
at the University of Alberta offers a maximum of one million 
bytes of core memory there was not enough memory space to 
manipulate such a large matrix in core. For economic reasons, 
the author was forced to trim the data base and included 
only those papers from the five selected classes published 
between the years of 1968 and 1971. 

It may be noted that minimizing the storage space is 
an important factor that merits special attention in large 
data base handling. Since there are 381 distinct keywords 
in the experimental data base, the programs for the first 
four phases as described in Chapter VI have to handle a 
square matrix of 381 x 381 elements at all times. It is 
obvious that a 381 x 381 logical*1l matrix may be used to 
store the elements of the document term matrix; this saves 


three quarters of the usual storage requirement. The term 
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connection matrix is an integer valued matrix, and the value 
of each element can be stored in two bytes in a integer*z 
matrix. As for the term relation matrix and the n - step 
term relation matrix, a full word is necessary to store the 
value of each element. By making use of the symmetric 
property of these matrices, only the upper triangle of the 
matrix has to be calculated; the lower triangle is merely 
the mirror image of the upper half. 

In the feature selection and ordering phase, the 


CSO02A subroutine in the University of Alberta Computing 


Science Program Library (39) was used to find the eigenvalues 


and eigenvectors of the covariance matrix. The routines 
employs full storage mode. Householder's method of 
tridiagonalizing the input symmetric matrix is used with a 
variant of the Q. R. algorithm to find the eigenvalues. The 
eigenvectors are found using inverse iteration. However, 
this subroutine program is not suitable for a huge matrix. 
In one particular run, the program used 4 minutes CPU time 
with an ellapsed time of 6 hours and 1,000,000 drum reads; 
the program was stopped by the computer operator, It is 
obvious that a paging problem exists in this subroutine, 

and using full storage mode to store a huge symmetric matrix 
may well be the main cause of this paging problem. The 
author was advised to use the International Mathematical 

and Statistical Library (IMSL) subroutine programs to tackle 
the problem. Instead of full storage mode, the symmetric 


storage mode was used to store the symmetric matrix; only the 
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lower triangle of the symmetric matric was input into the 
computer memory and these elements were stored in a vector 
forme 

The three IMSL subroutines used were EHOUSS, EQRT2S 
and EHOBKS (40). The EHOUSS routine computes a Householder's 
reduction of the input real symmetric matrix to a symmetric 
tridiagonalematrix.« Iteisialmodificationrofether Martins 
Reinsch, Wilkinson Algol Procedure TRED3. The EQRT2S routine 
is designed to find all eigenvalues and eigenvectors of a 
symmetricetridiagonaldmatrixzer Iti performsea Oveleealgorithm 
which is derived from the Q. R. algorithm. The routine is a 
modification of the Bowdler, Martin, Reinsch and Wilkinson 
Algol Procedure TQL2. The EHOBKS routine performs a back 
transformation to derive eigenvectors of the original matrix. 
It makes use of the details of the transformation in the 
Original matrix which were computed in EHOUSS. The routine 
is a modification of Martin, Reinsch, Wilkinson Algol 
Procedure TRBAK3. Although much computer time was still 
required in the computation, the IMSL routines proved to 
have solved the paging problem. 

The sample statistics estimation phase and classification 
phase involved only ordinary programming techniques and there 


are no details worth mentioning. 


163% Experimental ResuLts¢ 


The proposed automatic document classification system 


gave a fairly high degree of accuracy in document 


classification. Of the 25 documents in each group, the two 
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ranks classification scheme cornectly sassignedealtac5 
documents in group 1, 23 out of 25 documents in groupe, 
21 out of 25 documents in group 3, 22 out of 25 documents in 
group 4, and 20 out of 25 documents in group 5; these results 
davemtneuclassification accuracy of [002.8927 —cdcenca? sand 
80% respectively for the five groups. These results are 
based on regarding a classification as correct if the correct 
class appears in either the first or the second rank, The 
author claims that the percentage of correct classifications 
in the first or second rank for the proposed system is in 
the order of 804+5%. The results of the tests are summarized 
ineserabtext1+ 

Statistics showed that the system took approximately 
23 seconds of CPU time to classify 25 documents (the figure 
included the compile time) and this gave an average of 
0.92 second to classify each document. It may be of interest 
to note that an average of $3.10 was necessary to run a 
Classification program that classified 25) documents,and this 
gave an average cost of 12 cents to classify each document. 


A complete printout of the experimental classification 


results can be found in Appendix 4, 


724.8 Discussions sand) suggestions. 

The high degree of accuracy of the proposed system 
suggests that automatic classification based on the concept 
of fuzzy sets is a feasible alternative to manual 
classification scheme. The proposed system is also superior 


in terms of time and expense. From the CES UILLS sesteR LOMO 


surprising to find that the first four groups gave a better 
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explained by the fact that the documents in the first four 
groups were also those documents used to calculate the 
Statistical data of the system. However, the sudden drop 
of number of correct classifications listed in the first 
rank in group 5 (an average of 21 correct classifications 
listed in the first rank for the first four groups; 15 
correct classifications listed in the first rank for group 5) 
suggests that the statistical data of the system is not 
stable enough; more sample documents should be added to the 
Original data base in order to obtain more accurate statistics. 
It is believed that the performance of the proposed 
system may also be improved by some other means.. Increasing 
the number of keywords to represent each document by including 
keywords extracted from the abstract may well form one such 
means. However, such an increase will certainly increase 
the storage space required in the data base preparation, and 
at the same time will require more manual work in keyword 
selection at the initial stage. A more fruitful improvement 
can be achieved by using more significant features in the 
classification phase; the Karhunen - Loéve expansion scheme 
in feature selection and, ordering guarantees (a (better 


performance of the system with each increment in the number 
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CHAPTER VITI 
CONCLUSIONS 


This thesis has demonstrated that the concept of fuzzy 
relations and use of the Karhunen - Loéve expansion allows 
formulation of a feasible statistical approach to the 
decision making problem that occurs in automatic document 
classification, The concept of the proposed automatic 
document classification system is based entirely on the 
statistical relationships between keywords and subject 
categories. Keywords of a document are extended by using 
keyword association. Assuming that the distribution of the 
feature vector selected is multivariate normal for each 
pattern class, the mean feature vector and covariance matrix 
for each class are computed from the training samples. 

These statistical data are used as the bases for Bayes' 
classification in the experiment. 

Despite the fact that only 20 out of the possible 381 
distinct keywords were used as the selected features in the 
classification, the system performed adequately in the 
experiments giving 8045% accuracy in document classification. 
The relatively small number of features being consider in 
classification does not only save a lot of storage space in 
preparing the compressed data base; it also saved a 
considerable amount of computation time would have otherwise 
been required. 

Besides giving fast, economicalis and dependable 


service, the system also offers other desirable features. 
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It has the advantage of being highly flexible; a higher 
percentage of Bee tonnes accuracy is guaranteed when using 
more selected features in the classification. 

No doubt, every system is bound to have its own defects; 
the proposed system is of no exception. Large investment in 
terms of CPU time has to be spent in the preparation of the 
data base, especially during the Complete Fuzzy Relations 
Assignment Phase. In fact, this phase accounts more than 
POULETMithse ofthe! totalecPuUltimesusedsin=thiseprolece. 

It is desirable to have a general classification 
system which can classify documents from all disciplines. 
However, because of the computer memory limitation and the 
nature of the proposed scheme, the system had to limit its 
classi ticationsabi lity too fave specific fields” ing Computer 
Science. One may question the feasibility of the proposed 
system when applied in a large information centre because 
such application would involve too many keywords and would 
require computation of a large number of discriminant 
func ta ons 

Updating constitutes the most serious problem in the 
proposed system. When there are new developments in the 
fields of interest there are likely to occur a number of 
important new keywords which should be added to the original 
data base. This requires a serious modification of the 
entire data base and the most tedious portion of the 
calculations have to be repeated. 

Of course, strictly speaking, the proposed automatic 


document classification system is not totally automatic 
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since the Feature Preselection Phase is performed manually. 
It has been suggested that any manual work is undesirable 
in document classification since it may introduce biased 
interpretation and hinder the accuracy of the system. 

Further developments of the proposed system may include 
using more sample documents in preparation of the data base 
SO as to enlarge the vocabulary of the system. We expect 
that an increment in the number of sample documents for the 
individual classes should give a more stable set of statistical 
data for each class and hence lead to more accurate 
Glassihtcation. 

Since the fuzzy relation matrix of the distinct keywords 
in the data base is Symmetric, it is possible to save 
approximately half the storage space by storing the matrix 
in symmetric storage mode. This moderation does not only 
save storage but also eliminates the paging problem which is 
very common in large data base handling. 

To eliminate all manual work in the system, it is suggested 
that a “stop list" should be used to choose the imporcant 
keywords in a title. Undesirable keywords are recorded in 
thesstop list, and only those words other thane thoses listed 
in the stop list would be used to describe the document. 

In conclusion, the author believes that automatic 
computer classification of documents by the method described 
in the present thesis is a feasible alternative to manual 
classification because the results have proved that the 


performance of the former 1Smds) COC) aS, ire MU UU CT MeL ani. 


ue af +} J th a7 wa "mG (Pel vow Avnet i fie a 


habelt seuld iad) Yolsdo ebtew oan) “two bine sand rode ina 


pele heer eA9. yd wd ema gly, \e ne 69M 


DPory i -* * ' 
it hve i 


nt I a ner ams Ta i ccipaloapene pa cP 


ifaarreaiawe ve soon ceebaath yt rat beeen 
Mois te pa ubehain yeah dt yeaa oes 

jf | Sn? bo’ wal vanes ee nae nots evqaenr 

Wh Wan tel aye haere ote Te tec qa tem ‘sae we oe 


aon ot (aa 4404 4 £904 ei ateimes. owe site ys 


W mms 
bs tens rr negade art tenlitesdy oe? oquatnn oa seine 
Hush. 6foaaee 44 teen. dep enti poring F mm Re 
Phas be tiége 9hm ¢ ofl Ghaadevoeadats {oul wth ot : 
ian’ a4 Giget “onntctet eaele dome no oteb 
| not iad i rhaaata, 
in. ee 4 eiaa SiR s VAs07 wry asnhe *e.)) 
at .ohtitenes 20 ana arab mde 
a7.) At win | 1 ip WAty4 ‘ange? pit Pian Tet obo - 


rlrow 2 kT Peet oP 7-2 st-ideumene i 


iiyiiy ealdnie pAohAg-ane set aperttls @ ota aud agerode » 


~~ re 


Arihbauet ened coe apvel ar noes 718¥ 7 
os - 
= 
499 VOU ia GF04 12 Ot WA Oi ot! bj unde "gert node* ”. SEP i 
i . = - 


¥ 


pi ‘pobrasa avn a 1133 VN4 olan s) cotrepll ~of3)2: aon ab TOWYs 4 


Hipnignee afd cease) GO baew ad bfuce Pett qos itd. . 


ah gnonyns sand gavarlad reas, edt ynopadts 


was sense: ras sieeve wien ioe 7 
5 
ca iy til id Avent aetilbiiy 


106 


the classical manual approach. 

In order to make the computer classification more 
economical, and hence to allow application to larger sets 
of data, consideration should be given to development of 
approximate or iterative techniques for manipulation of the 
large matrices. Investigation of such techniques was 
believed to be beyond the scope of the present thesis since 
it was felt necessary to first study the unmodified use of 
fuzzy relations and the Karhunen - Loéve expansion. It is 
hoped, however, that the present study may be continued with 


an emphasis on methods of reduction of processing time. 
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$SRUN *FORTG 


é 
Cc RREKEKEKKEKKEKRHEKKKKKKKEKEKEKEKKKEEEK 
o * | . 
: * AUTOMATIC DOCUMENT CLASSIFIER « 
* ke 
C * PHASE 1 * 
G * * 
(é HREKKEKICKRKEKKEKKEKKEKKEKEKEKRKEKEKREKKEKKERK 
(s 
G 
C DECLARATION STATEMENTS 
C 
C 
INTEGER*®2 “KWORDCS,15), UPST(3, 200), D0C( 7200) 
PNT EGER*2°RUANK/" OV/ NO15/15/ 5 2ER0/ 0a ola een 
/KOUNT/O/, INDEX/0/ 
TOM R= 1 
C 
C READ A CARD FROM THE CARD FILE 
C 
ZOP READ CS  LOd ENDS SO) N SCCRWORDCU, loo U=2 50 se neaN® 
Cc 
C TEST FOR THE SPECIAL CASE 
C 
IFCN,NE.NO13)GOTO 30 
C 
G TEST FOR NUMERIC CLASS NUMBER IN THE LAST LOGICAL 
c RECORD 
‘¥ 
LFCKWORD(1,N).GT.ZERO.AND.KWORD(1,N).LT.SIX.AND. 
/KWORD(2,N).EQ.BLANK)GOTO 30 
C 
C MORE INFORMATION ON THE NEXT CARD 
GC 
K=N+1 
GOTO 20 
30 KOUNT=KOUNT#1 
C 
fe RECORD THE CORRECT CLASSIFICATION FOR EACH DOCUMENT ON 
eC Disk 
G 
WRETE( 3,104) KWORD(1,N) 
NN=N-1 
C 
C STORE UP THE KEYWORDS FOR EACH DOCUMENT ON DISK 
C 


WRITECK)NN, (CKWORD(1,JU), 1=1,3),J=1,NN) 
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RECORD THE CORRESPONDING DOCUMENT NUMBER FOR EACH 
KEYWORD 


(Spy (se) eee) 


DO 4O 1=1,NN 
INDEX=INDEX41 
DOC( INDEX) =KOUNT 


RECORD THE KEYWORDS FOR EACH DOCUMENT 


qoaqQgm 


DO 40 J=1,3 
KO LIST(J, INDEX)=KWORD(J, 1) 
GOTO? 0 


RECORD ALL THE KEYWORDS AND THEIR CORRESPONDING 
DOCUMENT NUMBERS ON DISK 
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SUeMRITECZ, LOZ CLISINU |Get.) pOO0e uD 


RECORD THE COUNTS FOR THE NUMBER OF KEYWORDS AND THE 
NUMBER OF DOCUMENTS ON DISK 


(se) (op) (se? (SD 


WRETECS, 103) INDEX, KOUNT, 
STOP 


FORMAT STATEMENTS 


CRG ECE) 


101 FORMAT(12,13(3A2)) 
RO Zen MAT GoA2 oil 2.) 
LOSMPORMAT CT S712) 
104 FORMAT(A2) 
END 
$$ENDFILE 
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SS CREATE “IN TYPE=SEQ 

$$CREATE -CLASS TYPE=SEQ 

$$CREATE -DOCU TYPE=SEQ 

$$CREATE -NUMBER TYPE=SEQ 

$$RUN -LOAD# 2=-1N 3=-CLASS 4=-DOCU 8=-NUMBER 


DOCUMENTS TO] BE -GLASSTEI ED 


SSENDFILEE 

SSCREATE -OUT TYPE=SEQ 
$$RUN *SORT 

SOR T=Ch Ase os CHA #72 
(INPUT=-IN3U320320 

OUT PUT=-0OUT 3U3;20;20 
MNR=200 
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DECLARATION STATEMENTS 


INTEGER #25 CLASS (2 Soe DKWORD( 3,88-15e lS 03,200),006(200) 
(DOCU GS 715), (DENT (2,25) 

REAL*4& FMAT(20,25),CDB( 20,381), INV(20, 20),MEAN( 20), 
/DUMMY (20),B8(5),FUNCT(5) 

DATA NUM/20/,JSTEP/1/,MAG/381/,N0/25/ 
SMALL=-0.5*10.0**70 


INITIALIZE FMAT TO ALL ZEROS 
DO 150 !=1,NO 

DO 150 J=1,NUM 

FMAT(J,1)=0.0 

READ THE COMPRESSED DATA BASE FROM TAPE 


REWIND 1 

DO 10 1t=1,NUM 
READ(1)(COB( I,J), J=1, MAG) 

READ THE DISTINCT KEYWORDS FROM DISK 


DO 20 J=1,MAG 
READC2, 101) (DKWORD(U, 1), J=1, 3) 


READ THE COUNTS FOR THE NUMBER OF KEYWORDS AND THE 
NUMBER OF DOCUMENTS FROM DISK 


READ( 7,102) INDEX, KOUNT 


READ THE CORRECT CLASSIFICATION FOR EACH DOCUMENT FROM 
DISK 


DO 19 J=1,KOUNT 
READ(3,103)CLASS(J) 


READ A KEYWORD AND ITS CORRESPONDING DOCUMENT NUMBER 
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40 
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FROM DISK 


DO 30 I=1,1NDEX 
READC SET ON) C LSd (KE 1D K=1).3)), DOCKHD 
DO 40 N=1,3 


COMPARE THE KEYWORD WITH A KEYWORD IN THE DISTINCT 
KEYWORD LIST 


IFCDKWORD(N, JSTEP).LT.LIST(N,1))GOTO 50 


TEST WHETHER THAT KEYWORD !S ABSENT IN THE DISTINCT 
REYWOROSEI ST 


LECDKWORDCNDUSTEPJ SGT I LINSTICN, 1))GOTO 30 
CONTINUE 


REYWORDAISTPRESENTONNGTHE DISTINCT KEYWORD Nia STA THE 
CONTRIBUTIONS OF THAT KEYWORD 1S ADDED TO THE FEATURE 
VECTOR 


MS=DOC(1) 

DO 60 LINK=1,NUM 

FMATCLINK, MS)=FMATCLINK,MS)+#CDBCLINK, JSTEP) 
GOTO 30 


COMPARE THAT KEYWORD WITH THE NEXT KEYWORD IN THE 
DISTINCT KEYWORD LIST 


VoOMEPoJo ert 
GOTO 70 
CONTINUE 


READ THE CONSTANTS FOR THE FIVE CLASSES 


READ (5, 95) CBGCLA), 1X=1.3) 
READ THE STATISTICS FOR CLASSIFICATION FROM TAPE 


D0)90 1l=1,KOUNT 
REWIND 8 
NOG8Oiv=1,5 


READ THE MEAN FEATURE VECTOR FOR A PARTICULAR CLASS 


READ(8) (MEAN(K), K=1, NUM) 


READ THE INVERSE OF THE COVARIANCE MATRIX FOR A 
PARTICULAR CLASS 
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DO 207 K=1,NUM 
READ(8)CINV(N,K),N=1, NUM) 


PERFORM THE NECESSARY CALCULATIONS FOR THE 
DISCRIMINANT FUNCTION 


DO 209 L=1,NUM 
MEAN(L)=FEMAT(L, 1) -MEANCL) 

DO 210 M=1,NUM 

DUMMY (M)=90.0 

DO 210 N=1,NUM 

DUMMY (M) =DUMMY (M) #MEANC(N) * INVON,M) 
RESULT=0.0 

DO 211 L=1,NUM 
RESULT=RESULT+MEAN(L) *DUMMY(L) 


OBTAIN THE NUMERIC VALUE FOR THE DISCRIMINANT FUNCTION 
OPSAY PART LECULAR «CLASS 


FUNCT (J) =B(J)-0.5*RESULT 


ASSIGN A CLASS NUMBER TO THE DOCUMENT BASED ON THE 
VALUES OF ITS DISCRIMINANT FUNCTIONS 


DOMS US Ns 1) 2 
BIG=FUNCT(1) 
1D=1 


FIND THE LARGEST VALUE OF THE DISCRIMINANT FUNCTIONS 


DOssSs" K=27,'5 

RE CFUNGT GR). LT.BIG)GOTO S15 
BIG=FUNCT(K) 

1D=K 

CONTINUE 


FIND THE SECOND LARGEST VALUE OF THE DISCRIMINANT 
FUNCTIONS 


IDENTC(N, 1)=10 
FUNCTC ID) =SMALL 
CONTINUE 


PRINT THE HEADINGS FOR THE OUTPUT 


WRITE(6,105) 
WRITE(6,106) 


PRINT OUT THE CLASSIFICATION FOR EACH DOCUMENT 
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129 
DO 700 KK=1,KOUNT 


RETRIEVE THE KEYWORDS FOR EACH DOCUMENT FROM DISK 


READ(9)NN, ((DOCUC I, J) 
WREREGG, UO 7) COnOCuGl= 
WRITEC6,108)CLASS(KK) 
CONTINUE 

STOP 
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FORMAT STATEMENTS 


FORMAT(3A2) 

FORMAT.CI 3,02) 

FORMATCA2 ) 

RORMATCSA2, 12) 

FORMAT('1',//19X, "KEYWORDS FROM EACH DOCUMENT',48X, 


PeCORRECTSCUASS*| 6%, “ASSIGNED <CLASS (7) 


FORMAT(' ',110X,"1ST RANK',4&X,'2ND RANK'/) 
FORMAT(5E14.4) 


VO /EORMATCUN 2k, SC 5A 2) 


108 
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END 


$SENDFILE 
$$RUN 
0007 ON 9TP *TAPE3* VOL=FCO007 RING=OUT LRECL=255 
BLKSIZE=5100 FMT=FB 'NEXT 0030' 

0030 ON 9TP *TAPE4Y* VOL=FC0030 RING=OUT LRECL=255 
BLKSIZE=5100 FMT=FB 

$SENDFILE 

$$RUN -LOAD# 1=*TAPEG* 2=KWORD 3=-CLASS 4=-OUT 8=*TAPE3* 
9=-DOCU 7=-NUMBER 


0. 
OF 


* MOUNT 


i L7 EO 2 Orr si le view c O.4470E 02 OS S2o Lemus 
ZO Eo 2 


$SENDFILE 
$$S1GNOFF 
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